Sequence File Formats Sequence File Formats Different formats for - PowerPoint PPT Presentation

Sequence File Formats

Sequence File Formats • Different formats for different uses • Competing formats developed in parallel • Some easy to read, some easy to write programs • Don’t have to stick to these formats, but parsers already written! • Most formats are plain text (not .bam files!)

Id’s versus accessions • When people first started, they were using gene names as id’s • But too few gene names, and databases require unique ids • Now have a variety of accession numbers • The simplest id is a number that you increment, as you can (almost) never run out of IDs.

Standard genetic code Symbol Meaning Origin G G Guanine A A Adenine C C Cytosine T T Thymine R G or A puRine Y T or C pYrimidine M A or C aMino K G or T Keto N G or A or T or C aNy

Standard protein codes One Three Amino acid One Three Amino acid A Ala Alanine M Met Methionine C Cys Cysteine N Asn Asparagine D Asp Aspartic acid P Pro Proline E Glu Glutamate R Arg Arginine F Phe Phenylalanine S Ser Serine G Gly Glycine T Thr Threonine H His Histidine V Val Valine I Ile Isoleucine W Trp Tryptophan K Lys Lysine Y Tyr Tyrosine L Leu Leucine X Xaa Unknown

Fasta • Simplest file format. Easy to parse, easy to use >identifier [optional information] ATGACTAGCATGCATCGATCGATCGACTAGCATG ACTGCACTACGACGACAGCAAC >identifier2 [optional information] ACTAGCTCAGCTAGAGAGCTACGATCAGCACTAC atccgatagcatgacttactACGCTAGCATCAGTCATA CAT

GenBank • More complex, includes detailed information on genes, cds, annotation etc • Human readable • Difficult to parse • Use standard parsers (bioperl, biojava, etc)

LOCUS NC_001418 5833 bp ss-DNA circular PHG 17-APR-2009 DEFINITION Pseudomonas phage Pf3, complete genome. ACCESSION NC_001418 VERSION NC_001418.1 GI:9626316 DBLINK Project:14061 KEYWORDS . SOURCE Pseudomonas phage Pf3 ORGANISM Pseudomonas phage Pf3 Viruses; ssDNA viruses; Inoviridae; Inovirus. FEATURES Location/Qualifiers source 1..5833 /organism="Pseudomonas phage Pf3" /mol_type="genomic DNA" /host="Pseudomonas aeruginosa" /db_xref="taxon:10872" /note="Pf3 bacteriophage DNA from P.aeruginosa infected with plasmid RP1." gene join(5763..5833,1..106) /locus_tag="Pf3_1" /db_xref="GeneID:1260905" CDS join(5763..5833,1..106) /locus_tag="Pf3_1" /note="orf 58, part 2" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_040651.1" /db_xref="GI:9626317" /db_xref="GeneID:1260905" /translation="MSYYVCVQLVNDVCHEWAERSDLLSLPEGSGLQIGGMLLLLSAT AWGIQQIARLLLNR"

3241 aggtcctgtt ggccttaaga tcacccaagg gcatcttgcc agatggtacc gtcattactt 3301 atgagaaaat atcctcaatg ggtaatggct ataccttcga gcttgagtcg cttatatttg 3361 cggctcttgc tcggtcttta tgcgaattac tgggcttacg accgtcagat gttacggtct 3421 atggcgatga cataatattg ccatcagacg cgtgcagtcc tctagttgaa gttttctcct 3481 atgttggttt tcgtaccaac aagaagaaaa cgttttctag tggaccgttc cgagagtcgt 3541 gcggaaagca ctactttttg ggcgttgacg tcacaccttt ctacatacgt cgccgtatag 3601 tgagtccctc cgatctcata ctggttttga accagatgta tcgttgggcc acaattgacg 3661 gcgtatggga tcctagggta tatcctgtat acaccaagta tagacgttac cttccggaaa 3721 ttctccggag gaatgtcgtg cctgatggat acggtgatgg tgccctcgtc ggatctgtct 3781 taatcagtcc tttcgcagaa aatcgcggtt gggttcggcg tgtgccgatg attatagaca 3841 agaggaaaga ccgagttcgt gacgaatatg gttcgtatct ctacgagcta tggtcgttgc 3901 agcaactcga atgtgacagt gagttcccct ttaacgggtc gctggtcgtt ggttccactg 3961 atggcactct cgcttacgca caccgagaac ggttacctac cgttatcagt gatgccgtaa 4021 gtgcgtttga catcatgtgg ataccgtgca gtagtcgtgt cctggctccc tacggggatt 4081 tccggaggca cgaaggctct atcctaaaaa tggggtagcg cctgggaggg gtgcattatg 4141 caccctaggt tagcaatact taaactaacc ttctcaaaag agagagtgaa ggctctgctt 4201 tgccctcact cctccca // LOCUS NC_003301 3192 bp ds-RNA linear PHG 23-AUG-2008 DEFINITION Pseudomonas phage phi8 segment S, complete sequence. ACCESSION NC_003301 VERSION NC_003301.1 GI:17736965 DBLINK Project:14731 KEYWORDS . SOURCE Pseudomonas phage phi8 ORGANISM Pseudomonas phage phi8 Viruses; dsRNA viruses; Cystoviridae; Cystovirus.

GFF3 • Columns: • Tab separated format 1. Contig • Easy to parse 2. Source database 3. Feature type • Attributes are tag/value 4. Start pairs separated by “;” 5. Stop 6. Score 7. Strand 8. Phase 9. Attributes

ASN.1 • Developed as computer readable form of GenBank • Not widely used

ASN.1 seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQA TGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFA PTLMSSCITSTTGPPAWAGDRSHE" } } , seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAGSRPNRFAPTL MSSCITSTTGPPAWAGDRSHE" } }

Base calling  Need to be sure which base you have identified  Depends on the technology  Each machine includes software  Phred is an historical package developed by at U. Washington  Phred scores are probability that the base is correct

Quality values  Phred 10: 1 x 10 1 chance that the base is wrong  Phred 20: 1 x 10 2 chance that the base is wrong  Phred 30: 1 x 10 3 chance that the base is wrong  Phred 40: 1 x 10 4 chance that the base is wrong  Phred 99: the base is correct!  Fastq scores are the score + 33 then converted to ascii text

FastQ • Based on fasta format • Contains information about the quality of the sequence • Quality comes from sequencing machines! • Four lines per sequence: • Line starting @ = identifier line before the sequence • DNA sequence • Line starting + = identifier line before the quality scores • String = quality scores as ASCII + 33

ASCII character codes ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char 33 ! 50 2 70 F 90 Z 110 n 34 " 51 3 71 G 91 [ 111 o 35 # 52 4 72 H 92 \ 112 p 36 $ 53 5 73 I 93 ] 113 q 37 % 54 6 74 J 94 ^ 114 r 38 & 55 7 75 K 95 _ 115 s 39 ' 56 8 76 L 96 ` 116 t 40 ( 57 9 77 M 97 a 117 u 41 ) 58 : 78 N 98 b 118 v 42 * 59 ; 79 O 99 c 119 w 43 + 60 < 80 P 100 d 120 x 44 , 61 = 81 Q 101 e 121 y 45 - 62 > 82 R 102 f 122 z 46 . 63 ? 83 S 103 g 123 { 47 / 64 @ 84 T 104 h 124 | 48 0 65 A 85 U 105 i 125 } 49 1 66 B 86 V 106 j 126 ~

fastq DNA sequence @SRR014849.1 EIXKN4201CFU84 length=93 GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5’9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8== Quality scores Note: Illumina has a format of fastq that is not compatible with everyone else’s format!

How to convert fastq to fasta ● prinseq-lite.pl -fastq input.fastq -out_format 2 ● https://edwards.sdsu.edu/research/fastq-to- fasta/

Sequence File Formats Sequence File Formats Different formats for - PowerPoint PPT Presentation

Sequence File Formats Sequence File Formats Different formats for different uses Competing formats developed in parallel Some easy to read, some easy to write programs Dont have to stick to these formats, but parsers

Open source software for the keen file formats Ramn photographer: file formats Casero Caas

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

File Management What is a file? Elements of file management File organization

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Working with text xt file formats CSV, JSON, XML, Excel regular expressions module

Working with text xt file formats CSV, JSON, XML, Excel regular expressions module

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Graphic Design Year 7 This Lesson Know what is meant by a file extension and to give examples

Scien&fic Data File Formats Han-Wei Shen The Ohio

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Sequences are related Darwin: all organisms are related through descent with modification

CSE 427 Computational Biology Gene Prediction A statistical interlude: Fair or biased? H H H H

MedicineInsight Novel use of electronic health record (EHR) data to improve the diagnosis and

AP Chemistry The Atom www.njctl.org Slide 3 / 121 Deducing the structure of the atom took a

Process Design for Mineral Sem inar Process Design for Mineral Operations Operations Luis A.

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS

Sequence Alignment COMPSCI 260 Spring 2016 Why do we

Virt irtual al Ch Chemis emistry: y: B Buil ildin ding labs abs in insid ide e comp

Sambuz

Useful Links

Newsletter

Mail Us

Sequence File Formats Sequence File Formats Different formats for - PowerPoint PPT Presentation

Sequence File Formats Sequence File Formats Different formats for different uses Competing formats developed in parallel Some easy to read, some easy to write programs Dont have to stick to these formats, but parsers

Open source software for the keen file formats Ramn photographer: file formats Casero Caas

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

File Management What is a file? Elements of file management File organization

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Working with text xt file formats CSV, JSON, XML, Excel regular expressions module

Working with text xt file formats CSV, JSON, XML, Excel regular expressions module

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Graphic Design Year 7 This Lesson Know what is meant by a file extension and to give examples

Scien&amp;fic Data File Formats Han-Wei Shen The Ohio

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Sequences are related Darwin: all organisms are related through descent with modification

CSE 427 Computational Biology Gene Prediction A statistical interlude: Fair or biased? H H H H

MedicineInsight Novel use of electronic health record (EHR) data to improve the diagnosis and

AP Chemistry The Atom www.njctl.org Slide 3 / 121 Deducing the structure of the atom took a

Process Design for Mineral Sem inar Process Design for Mineral Operations Operations Luis A.

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS

Sequence Alignment COMPSCI 260 Spring 2016 Why do we

Virt irtual al Ch Chemis emistry: y: B Buil ildin ding labs abs in insid ide e comp

Sambuz

Useful Links

Newsletter

Mail Us

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Scien&fic Data File Formats Han-Wei Shen The Ohio