25 ‐ Mar ‐ 15 Biology is Big Data science Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26 th 2015 power doubles every ~2 years. History How would you figure out the function of a protein? Activity assay X ‐ ray structure IBM 7090 computer • First protein sequence: bovine insulin (51 amino acids, 1956) • Atlas of Protein Sequence and Structure (1965) – Margaret Oakley Dayhoff • Protein DataBank (10 proteins, 1972) – X ‐ ray crystallographic protein structures Knock ‐ out mouse SWISSPROT (1987) • – Protein sequence database • Genbank (1982) – Nucleotide and protein sequences BLAST search Fasta files Fasta file extensions • Biological sequences are stored in Fasta files • The file extension of a Fasta file is .fa or .fasta • Fasta files are plain text files (open e.g. in ) • The preferred extension for protein Fasta files is .faa – Fasta Amino Acid Every new sequence entry starts with a “>” sign at the start of a line >protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV Each sequence has an identifier >protein_seque >protein_sequence_B nce_B that has to be unique in the file MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_A nce_A >protein_seque >protein_sequence_C nce_C MTQSSHAVAA FDL MT Q Q SSHAVAA FDLGAALR GAALRQE GLTETDYSE Q Q E GLTETDYSEI I QRDPNRAELG TFGV Q RDPNRAELG TFGV MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA L GRAMFGVMWS EHCCYRNDDA YRNDDA YRNDDA >protein_sequence_B >protein_seque nce_B RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C • The preferred extension for DNA Fasta files is .fna MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA – Fasta Nucleic Acid RPLLRPIKSP FGAWVVIV RPLLRPIKSP FGA WVVIV >DNA_sequence_X >DNA_sequence_ GAGGAATTCA TAGCTGACGA GTCGAGTGA GAGGAATTCA TAG CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y The sequence can be on one or more lines Spaces and newlines just make CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG until the next “>” at the start of a new line sequences easier to read/count, >DNA_sequence_Z >DNA_sequence_ they do not have any meaning GAAGCTGACC CGTTTCCGGA AGAGGGAGG GAAGCTGACC CGT TTCCGGA AGAGGGAGG 1
25 ‐ Mar ‐ 15 DNA sequencing Bad quality sequencing read • DNA sequencing depends on A/C/G/T signal being “read” – Differently colored fluorophore signals – Signal is not always unambiguous • DNA sequencing machines estimate the quality of a sequenced nucleotide DNA sequencing quality scores Good quality sequencing read • DNA sequencing quality is measured in Phred scores – Phred 10: 10 ‐ 1 chance that the base is wrong • 90% accuracy; 10% error rate – Phred 20: 10 ‐ 2 chance that the base is wrong • 99% accuracy ; 1% error rate – Phred 30: 10 ‐ 3 chance that the base is wrong • 99.9% accuracy ; 0.1% error rate – Etcetera • Phred scores in Fastq files are stored as ASCII characters – Phred score + 33, converted to ASCII text Fastq Genbank format • Sequencing output and quality are stored in Fastq format • Used by the Genbank database – Based on Fasta format – Used in sequence similarity – Contains information about quality of each nucleotide searches (more about this – Quality score is estimated by sequencing machine later) >sequence_identifier_1 @sequence_identifier_1 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC +sequence identifier 1 >sequence identifier 2 q q _ _ _ _ AAGCATCCGAATGACGAGCTAGGAGAGATCTGAGCCTTTCAAA hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E% • Four lines per sequence: – Identifier line starting with @ • Contains the sequence and – DNA sequence on one line all related information – Second identifier line starting with + • Click on “FASTA” to get Fasta – String of quality scores on one line, encoded in ASCII characters format 2
25 ‐ Mar ‐ 15 Central paradigm of Bioinformatics International Nucleotide Sequence Database Collaboration • INSDC is a collaboration between: • Central dogma of (molecular) biology – DNA Data Bank of Japan (DDBJ) – National Center for Biotechnology Information (NCBI) – European Molecular Biology Laboratory / European Bioinformatics Institute (EMBL ‐ EBI) • Biological sequences encode a lot of information • One of the most important applications of Bioinformatic Data Analysis is to extract this information Protein families • Pfam database • SEED database • EGGnog database Ribosomal RNA genes Protein structures • Small subunit ribosomal RNA (SSU rRNA) is a universal marker gene that indicates the taxonomic group of an organism, and was used to discover the three domains in the Tree of Life (ToL) – 16S rRNA (Bacteria and Archaea) – 18S rRNA (Eukaryotes) 3
25 ‐ Mar ‐ 15 Transcription factor binding sites (TFBS) Metabolic pathways (etc.) Scientific literature Protein interactions Using databases reproducibly in science • Databases are not static, but are constantly updated • Thus, every entry (sequence, protein, structure, function, etc.) in a database has a unique identifier – Sometimes identifiers are changed in new versions of the database • Entries can be retrieved from the database – Search possibilities depend on the database – Often the complete database can be downloaded for large ‐ scale analyses – When you access a database, note the version of the database or the date of accessing the database for reproducible science! 4
Recommend
More recommend