I.33 Looking back at part 1: Key items Profiles provide reasonable estimate of the potential for a TF to � bind to a sequence in vitro (i.e. in the lab) In vitro binding is not predictive of in vivo function (i.e. in the cell) � Prediction of promoters with CpG islands is useful, but detection � of the other 50% of promoters is poor There are two reasonable methods to improve the prediction of � individual TF binding sites Phylogenetic Footprinting identifies sites conserved across � evolution, improving specificity by an order of magnitude in the best cases Analysis of clusters of TFBS for biologically linked TFs can improve � specificity by two orders of magnitude
Analysis of regulatory sequences controlling the expression of biological networks Part II: Discovery of novel regulatory controls for co-expressed genes Wyeth Wasserman Albin Sandelin Boris Lenhard
II.1 Tutorial AM4 Analysis of regulatory sequences controlling the expression of biological networks A. Introduction: the problem B. Selection of promoter sequences C. Algorithms for pattern discovery D. Enhancing pattern discovery
II.2 Approaches in promoter analysis Predicting binding sites for factor X in promoter Y Sliding models over sequences (pattern detection) ATGCTATAGTGTGCACGATCGATGCTAGTGCATCAA CAGCTG CAGCTG TGTCGGGAA Analyzing shared properties in promoter sets Over-representation (pattern discovery)
II.3 The problem Given a set of ”co-regulated” genes, define motifs over-represented in the regulatory regions Definitions Co-regulation: Genes with similar expression patterns resulting from the influence of one or more common control mechanisms
II.4 Gene Networks What are gene networks How are gene networks defined
II.5 Definitions of Gene Networks in Genomics Data(1) Why is co-regulation occuring? NUCLEAR DIVISION Why is it frequent? G2 � Large protein systems The cell cycle are often activated G1 simultanosly � Evolution works by DNA DUPLICATION modifying existing systems
II.6 Definitions of Gene Networks in Genomics Data(2) Systems potentially co-regulated : Protein complexes � Pathways � Proteins associated with specific � process Developmental program �
II.7 Selection of Promoter Sequences for Analysis How do we define our set of promoters? The good, the bad and the ugly --- benchwork Microarrays Other...
II.8 Selection of Promoter sequences for analysis (1) Microarrays Provides ’snapshots’ of mRNA levels in the cell mRNA distribution for certain genes are related Cluster genes that are expressed in similar fashion Concept: co-expressed genes believed to be co-regulated
II.9 Selection of Promoter sequences for analysis (2) Pros: Relative concentrations of Relative concentrations of microarrays in ISMB microarrays in ISMB Wealth of data abstracts abstracts Cons: 1 .2 1 .2 Often noisy 1 1 Expensive 0.8 0.8 Secondary effects a problem 0.6 0.6 Coexpression != Co-regulation 0.4 0.4 0.2 0.2 0 0 2000 2001 2002 2000 2001 2002
II.10 Selection of Promoter sequences for analysis (3) Other approaches: Litterature-based selection Chromatin immuno-precipitation Green Fluorescent Protein based approaches
II.11 Selection of Promoter sequences for analysis (4) Online Resources Expression databases General : NCBI Gene Expression Omnibus EMBL ArrayExpress Stanford Microarray Database
II.12 Selection of Promoter sequences for analysis: Key items � Functionally related proteins are often co-regulated � By analyzing co-regulated genes, we aim to find shared regulatory signals � Selection of co-regulated genes is non-trivial. High throughput methods tend to be too noisy for our needs � Filtering selected sequences using complementary data is often benificial
II.13 Methods for Pattern Discovery � Word-based vs matrix-based � Exhaustive � Probabilistic � Enhancements
Methods for Pattern Discovery AAGTTAATGATTAAC � Word-based � Matrix-based TFBS are words TF:s do not bind to words Words are easily counted Pros Pros Realistic complexity Matrix models are more accurate descriptions of binding preferences Based on well-understood statistics Cons Cons Unrealistically expensive in computer time (if optimals are TF binding properties are often sought) degenerate
II.14 Exhaustive Methods for Pattern Discovery What is an exhaustive method? Types of exhaustive methods � Word based � Matrix based
II.15 Exhaustive methods(1) Computer science: Exhaustive algorithm: All possible solutions are evaluated: often VERY CPU intensive for large indata In this context Count all possible motifs/words. Analyze over-representation
II.16 Exhaustive methods(2) Word based methods: How likely are X words in a set of sequences, given sequence characteristics? CCCG CCGGAA TGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACG CCGGAA TAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAA CCGGAA TATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGC CCGGAA TAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTA CCGGAA AGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAAT CCGGAA TTTCCAC CCGGAA TTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTC CCGGAA TCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACA CCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATG CCGGAA TTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range
II.17 Exhaustive methods(3) Over-representation k ] ∏ [ = P w begins in i p ( a ) j How many words of = j 1 type ’ AGGAGTGA ’ are found in our k [ ] ∏ = − + E X ( n k 1 ) p ( a ) sequences? w j = j 1 [ ] How likely is − X E X = w w Z this result? [ ] w Var X w
II.18 Exhaustive methods(4) Background properties Simple: How likely are single nucleotides? (extended Bernoulli) Complex: Neglect certain words Locations of TFBS Higher-order descriptions of DNA
II.19 Exhaustive methods(5) Find all words of length 7 in the yeast genome GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG Make a lookup table: ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA AAACCTTT 456 ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT TTTTTTTT 57788 TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA GATAGGCA 589 AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG Etc... ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
II.20 Exhaustive methods(6) Matrix based methods cagagcgat AGGTCA acgataatat A 3 0 0 0 0 4 gcgatagca AGGTCG ccccgtatag C 1 1 1 0 5 0 aacttggtt AGGTCA ttagcgagta G 1 4 3 0 0 1 ggggatggg CCCTCA aatacgcgga T 0 0 1 5 0 0 aaccggaag GGTTCA acgatctatt = local multiple alignment No current exhaustive methods, due to NP-completeness
II.21 Exhaustive methods(7) Resources Moby Dick ( Bussemaker et al ) (not online) Dyad analysis ( van Helden et al ) YMF (Sinha and Tompa)
II.22 Exhaustive methods: Key items Algorithms with high complexity - Large � sequences and/or many possible word lengths not possible Often word-based � TFBS are not words (’fuzzy’ binding) � Sensitivity susceptible to noisy indata � (e.g. microarrays)
II.23 Probabilistic Methods for Pattern Discovery � What is a probabilistic method? � The Gibbs sampler algorithm � Improving background models
II.24 Probabilistic Methods for Pattern Discovery(1) Computer science: Probabilistic algorithm: uses randomness Bioinformatics: Probabilistic algorithm often the same as Monte Carlo algorithm: an approximation algorithm that always is fast but does not always give the best solution
II.25 Probabilistic Methods for Pattern Discovery(2) Overview: Find a local alignment of width x of sites that maximizes information content in reasonable time Usually by Gibbs sampling or EM methods Motivation: TFBS are not words Efficiency Can be intentionally influenced by biological data
II.26 Probabilistic Methods for Pattern Discovery(3) tgacttcc The Gibbs Sampling algorithm tgatctct agacctca tgacctct Two data structures used: 1) Current pattern nucleotide frequencies q i,1 ,..., q i,4 and corresponding background frequencies p i,1 ,..., p i,4 2) Current positions of site startpoints in the N sequences a 1 , ..., a N , i.e. the alignment that contributes to q i,j. One starting point in each sequence is chosen randomly initially.
II.27 Probabilistic Methods for Pattern Discovery(4) Iteration step z Remove one sequence z from the A set. Update the current pattern tgacttcc tgatctct according to agacctca tgacctct Pseudocount for symbol j + c b i , j j = q i , j − + N 1 B Sum of all pseudocounts in column ’Score’ the current pattern against each possible occurence B a k in z . Draw a new a k with probabilities based on respective score divided by the background model
II.28 Probabilistic Methods for Pattern Discovery(5) Sensitivity weaknesses: ’Pattern drowning’ 18 vs. TRUE MEF2 PROFILE PATTERN SIMILARITY 16 14 12 10 0 100 200 300 400 500 600 SEQUENCE LENGTH True Mef2 Binding Sites
II.29 Probabilistic Methods for Pattern Discovery(6) Correction for background properties Workman & Stormo (ANN-Spec) – Train on background set as well to find ’commonly occuring’ patterns. Maximization of probabililty of finding pattern in positive sequences and not in background seqsequences In effect: Try to discriminate between ’common’ and ’novel’ patterns Thijs et al, Bailey and Elkan Markov background model describing DNA in m :th order
II.30 Probabilistic Methods for Pattern Discovery(7) What is a higher-order background model? p(A)=0.29, ∏ = P ( seq ) P ( nucleotide ) p(C)=0.21, i Zero-order: = i 1 ... N p(G)=0.21, p(T)=0.29 G A First-order: A T A C m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases
II.31 Probabilistic Methods for Pattern Discovery(8) Online resources Gibbs Motif Sampler( Lawrence et al ) MEME( Bailey and Elkan ) AnnSpec( Workman and Stormo ) AlignAce( Roth et al )
II.32 Probabilistic Methods for Pattern Discovery: Key items Gibbs Sampling/EM algorithms Complexity is moderate. � Optimality not guaranteed. Low sensitivity: patterns � ’drown’ in large sequences (~>500 bp) Sensitivity susceptible to noisy � input data (e.g. microarrays)
II.33 Enhancing pattern discovery sensitivity � Cross-species comparison � Modelling of pattern constraints information content structural constraints palindromicity � Usage of prior knowledge
II.34 Enhancing pattern detection sensitivity (1) Search only where TFBS are likely: Cross-species comparison Use as filtering Or include orthologous sequences in analysis
II.35 Enhancing pattern detection sensitivity (2) TFBSs are not randomly drawn Information segmentation Information content distributions of TFBS are distinctly non-random (Wasserman et al 2000) Palindromicity, dyads ( van Helden et al 2000) Variable gaps ( Hu 2003)
II.36 Enhancing pattern detection sensitivity (3) Building in biological knowledge in pattern finding - priors How do priors work? Essentially by increasing the psudocounts by some fraction submitted in the prior Example: A certain residue is according to our prior knowledge an A in 47/100 cases. New pseudocount for first residue, A: 50/100 x k x#number of sites
II.37 Enhancing pattern detection sensitivity (4) ’Biasing’ probabilistic pattern finder with prior knowledge - an unexplored area Examples Structural constraints (example: HLH factors have certain shared binding preferences) Information content ’landscape’
II.38 Enhancing pattern detection sensitivity (5) Example of enhanced sensitivity using biologically based priors use prior no prior background (no sites)
II.39 Enhancing pattern detection sensitivity: Key items Difficulties in pattern detection are attributed to: Low signal strength Cross species comparison as filters Complexity of background DNA New >0 order models Simplified models of TFBS properties Dyads, palindromes Segmentation of information content distributions Usage of biological knowledge as priors
II.40 Evaluation of Patterns � How relevant is our new pattern? � Algorithms for pattern comparison
II.41 Evaluation of patterns(1) Pattern finders can generally NOT distinguish between patterns and over-represented ’junk’ The sequence analogy Q: How do we know if Q: How do we know a a pattern is true function of a gene A: In the lab! A: In the lab! Q:How can we avoid Q:How can we avoid labwork? labwork? A: Compare to already A: Compare to already known patterns! known sequences!
II.42 Evaluation of patterns(2) Algorithms for pattern comparison Hughes et al Based on protein BLOCKS alignment algorithm ( Pietrokowski ) Sandelin & Wasserman Needleman-Wunsch variant
II.43 Evaluation of patterns(3) Online Resources CompareAce ( Hughes et al ) JASPAR ( Sandelin & Wasserman ) Integrated systems (yeast) Atlas YRSA
II.44 Looking back at part II: Pattern discovery is based on finding shared TFBS for co-regulated genes Co-regulated sequences can be based on different experiments and different clustering analysis algorithms Algorithms for pattern discovery can be exhaustive probabilistic Enhanced sensitivity can be achieved by Cross-species comparison Information segmentation Biologically based priors New patterns can be compared to verified patterns
Analysis of regulatory sequences controlling the expression of biological networks Part III: Programming Resources for the Analysis of Regulatory Sequences Wyeth Wasserman Albin Sandelin Boris Lenhard
III.1 Section 3.1 Orientation Objectives TFBS –Basic features Prerequisites for this part of the tutorial
III.2 3.1 Orientation Objectives Purpose: Introduction to the tools and “discipline” of � practical analysis of regulatory regions and networks After mastering the material presented here, you should � be able to script and automate computational methods for the analysis of regulatory sequences. More advanced Perl users will be provided with enough � information to contribute extensions and enhancements to the existing computational framework for regulatory sequence analysis.
III.3 3.1 Orientation Regulatory regions problem space Sets of Specificity profiles for binding sites Sets of Specificity profiles for binding sites binding A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] binding A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] sites C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] sites G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] AATCACCA T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] AATCACCA T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] AATCACCA AATCACCA AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCC AATCTCCG AATCTCCG AATCACAC AATCACAC AATCATCA AATCATCA AATCTCAC AATCTCAC AATCTCTG AATCTCTG Clusters of binding sites AGTCCCCA Clusters of binding sites AGTCCCCA AATCCCGG AATCCCGG AATCTGAG AATCTGAG AATCCATA AATCCATA ATTCAGCC ATTCAGCC AATAACTT Transcription factors AATAACTT Transcription factors GATAACCT GATAACCT AATTAGAC AATTAGAC URF Pol-II GATTACAG GATTACAG URE TATA GATTAGCG GATTAGCG ATTCTTCC ATTCTTCC Transcription factor binding sites Transcription factor binding sites TATGAACA TATGAACA Regulatory nucleotide sequences GATTAAAA Regulatory nucleotide sequences GATTAAAA AGACCCCA AGACCCCA
III.4 3.1 Orientation TFBS – basic features A computational framework for transcription factor binding site � analysis and manipulation Implemented in Perl � Models and manipulates common objects from the regulatory � regions object space Patterns representing TF specificity � Sets of patterns � Detected binding sites � Sets of detected binding sites � Unified DB interface to pattern databases � Pattern generators � Enables easy scripting and automation of regulatory region � analysis
III.5 3.1 Orientation Prerequisites for full benefit from this part of the tutorial Several concepts covered in parts I and II � Profile matrices (PFM, ICM, PWM) � phylogenetic footprinting � Basic to intermediate knowledge of Perl programming � General syntax and semantics � References and complex data structures � Basic knowledge of Unix OSs and an access to one is needed � at the moment
III.6 Section 3.2 Perl and Bioperl Perl and Perl objects Bioperl Selected Bioperl objects
BLAST report TBLASTN 2.2.3 [May-13-2002] III.7 Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 3.2 Perl and Bioperl Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= Bioinformatics problem|object space (958 letters) Database: /opt/cgb/PEGASUS/BLAST/Pegasus_transcript 3265 sequences; 7,560,541 total letters Searching.......done Score E Sequences producing significant alignments: (bits) Value 9550__TRANSCRIPT__2305 KIAA1464 protein [OTHER] 68 8e-12 10825__TRANSCRIPT__1863 RAN binding protein 9 [OTHER] 66 3e-11 9615__TRANSCRIPT__2321 DEAD/H (Asp-Glu-Ala-Asp/His) box polypept... 39 0.005 LOCUS SARS 1942 bp mRNA linear PRI 04-FEB-2003 Hit DEFINITION Homo sapiens seryl-tRNA synthetase (SARS), mRNA. >9550__TRANSCRIPT__2305 KIAA1464 protein [OTHER] ACCESSION NM_006513 Length = 1488 VERSION NM_006513.2 GI:16306547 KEYWORDS . Score = 67.8 bits (164), Expect = 8e-12 SOURCE Homo sapiens (human) Identities = 41/112 (36%), Positives = 57/112 (50%), Gaps = 9/112 (8%) ORGANISM Homo sapiens Frame = +1 Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; HSP Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. Query: 504 GFDLNVFGYCGFDGLITNSTEQSKEYAKPFGRDDVIGCGINFIDGSIFFTKNGIHLGNAF 563 REFERENCE 1 (bases 1 to 1942) G+D + +GY G DG S+ + Y F DVIGC +N I+G+ F+TKNG LG AUTHORS Hartlein,M. and Cusack,S. Sbjct: 1 GWDKHSYGYHGDDGHSFCSSGTGQPYGPTFTTGDVIGCCVNLINGTCFYTKNGHSLGVCI 180 TITLE Structure, function and evolution of seryl-tRNA synthetases: implications for the evolution of aminoacyl-tRNA synthetases and Query: 564 TDLN--------DLEFVPYVALR-PGNSIKTNFGLNEDFVFDIIGYQDKWKS 606 Annotation the genetic code DL P V L+ PG + NFG + F+FDI Y +W++ JOURNAL J. Mol. Evol. 40 (5), 519-530 (1995) Sbjct: 181 RDLGGSALWSHFGWNLYPTVGLQTPGEIVDANFG-QQPFLFDIEDYMREWRA 333 MEDLINE 95302522 PUBMED 7540217 REFERENCE 2 (bases 1 to 1942) AUTHORS Vincent,C., Tarbouriech,N. and Hartlein,M. Score = 45.4 bits (106), Expect = 4e-05 TITLE Genomic organization, cDNA sequence, bacterial expression, and Identities = 29/118 (24%), Positives = 51/118 (42%) purification of human seryl-tRNA synthase Frame = +1 HSP JOURNAL Eur. J. Biochem. 250 (1), 77-84 (1997) MEDLINE 98092290 Query: 708 GSLPNTLNVMINDYLIHEGLVDVAKGFLKDLQKDAVNVNGQHSESKDVIRHNERQIMKEE 767 PUBMED 9431993 G L M++ YL+H G A F R E I +E+ COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The Sbjct: 379 GEWQAVLQNMVSSYLVHHGYCATATAF---------------------ARMTETPIQEEQ 495 reference sequence was derived from BC000716.1, BC009390.1 and X91257.1. Query: 768 RMVKIRQELRYLINKGQISKCINYIDNEIPDLLKNNLELVFELKLANYLVMIKKSSSK 825 On Oct 22, 2001 this sequence version replaced gi:5730028. +K RQ+++ L+ +G++ + I P LL++N L+F LK ++ M+ + S+ Summary: This gene belongs to the class II amino-acyl tRNA family. Sbjct: 496 ASIKNRQKIQKLVLEGRVGEAIETTQRFYPGLLEHNPNLLFMLKCRQFVEMVNGTDSE 669 The encoded enzyme catalyzes the transfer of L-serine to tRNA (Ser) and is related to bacterial and yeast counterparts. COMPLETENESS: complete on the 3' end. Hit >10825__TRANSCRIPT__1863 RAN binding protein 9 [OTHER] FEATURES Location/Qualifiers Length = 1107 source 1..1942 /organism="Homo sapiens" Score = 65.9 bits (159), Expect = 3e-11 /db_xref="taxon:9606" Identities = 45/131 (34%), Positives = 65/131 (49%), Gaps = 1/131 (0%) /chromosome="1" Frame = +1 /map="1p13.3-p13.1" gene 1..1942 Query: 485 GVSAMSLNVDGSINKCQKYGFDLNVFGYCGFDGLITNSTEQSKEYAKPFGRDDVIGCGIN 544 � Examples: BLAST and genbank /gene="SARS" G+SA +N++ + G+D + +GY G DG S+ + Y F DVIGC +N HSP /note="synonyms: SERS, SERRS" Sbjct: 10 GLSAQGVNMN------RLPGWDKHSYGYHGDDGHSFCSSGTGQPYGPTFTTGDVIGCCVN 171 /db_xref="LocusID:6301" /db_xref="MIM:607529" Query: 545 FIDGSIFFTKNGIHLGNAFTDLNDLEFVPYVALR-PGNSIKTNFGLNEDFVFDIIGYQDK 603 misc_feature 79..234 I+ + F+TKNG L + L P V L+ PG + NFG FVFDI Y + /gene="SARS" Sbjct: 172 LINNTCFYTKNGHSLDVKYAILQP-NLYPTVGLQTPGEVVDANFG-QHPFVFDIEDYMRE 345 /note="Seryl_tRNA_N; Region: Seryl-tRNA synthetase N-terminal domain. This domain is found associated with sequence: Query: 604 WKSLAYEHICR 614 the Pfam tRNA synthetase class II domain (pfam00587) and W++ I R represents the N-terminal domain of seryl-tRNA synthetase" Sbjct: 346 WRTKIQAQIDR 378 /db_xref="CDD:pfam02403" Sequence misc_feature 82..1482 /gene="SARS" /note="Region: Seryl-tRNA synthetase [Translation, Score = 45.1 bits (105), Expect = 6e-05 ribosomal structure and biogenesis]" Identities = 26/119 (21%), Positives = 52/119 (42%) � combination of hierarchical /db_xref="CDD:COG0172" Frame = +1 HSP misc_feature 640..1167 /gene="SARS" Query: 707 DGSLPNTLNVMINDYLIHEGLVDVAKGFLKDLQKDAVNVNGQHSESKDVIRHNERQIMKE 766 features /note="ThrS; Region: Threonyl-tRNA synthetase +G + M++ YL+H G A+ F R ++ +++E [Translation, ribosomal structure and biogenesis]" Sbjct: 397 EGEWQTMIQKMVSSYLVHHGYCATAEAF---------------------ARSTDQTVLEE 513 /db_xref="CDD:COG0441" misc_feature 649..1152 Query: 767 ERMVKIRQELRYLINKGQISKCINYIDNEIPDLLKNNLELVFELKLANYLVMIKKSSSK 825 /gene="SARS" and sequential elements: can +K RQ ++ L+ G++ + I P LL+ N L+F LK+ ++ M+ + S+ /note="tRNA-synt_2b; Region: tRNA synthetase class II core Sbjct: 514 LASIKNRQRIQKLVLAGRMGEAIETTQQLYPSLLERNPNLLFTLKVRQFIEMVNGTDSE 690 domain (G, H, P, S and T). Other tRNA synthetase sub-families are too dissimilar to be included. This Hit domain is the core catalytic domain of tRNA synthetases >9615__TRANSCRIPT__2321 DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 1 [SIGNALING] and includes glycyl, histidyl, prolyl, seryl and threonyl Length = 2706 tRNA synthetases" /db_xref="CDD:pfam00587" Score = 38.5 bits (88), Expect = 0.005 be used to represent most misc_feature 664..1161 Identities = 31/90 (34%), Positives = 43/90 (47%), Gaps = 4/90 (4%) /gene="SARS" Frame = +1 HSP /note="ProS; Region: Prolyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]" Query: 510 FGYCGFDGLITNS-TEQSKEYAKPFGRDDVIGCGINFIDGSIFFTKNGIHLGNAF---TD 565 /db_xref="CDD:COG0442" FG+ GF G S +Q Y + F D IGC ++ G + F+KNG LG AF polyA_signal 1831..1836 Sbjct: 775 FGF-GFGGTGKKSHNKQFDNYGEEFTMHDTIGCYLDIDKGHVKFSKNGKDLGLAFEIPPH 951 /gene="SARS" complex data entities polyA_site 1852 Query: 566 LNDLEFVPYVALRPGNSIKTNFGLNEDFVF 595 /gene="SARS" + + P L+ +K NFG E+F F /evidence=experimental Sbjct: 952 MKNQALFPACVLK-NAELKFNFG-EEEFKF 1035 polyA_site 1895 /gene="SARS" /evidence=experimental Database: /opt/cgb/PEGASUS/BLAST/Pegasus_transcript BASE COUNT 530 a 460 c 538 g 414 t Posted date: Nov 13, 2002 1:39 PM ORIGIN Number of letters in database: 7,560,541 encountered in bioinformatics 1 gcagtgcggc ggtcacaggc tgagtgctgc ggcgcgatcc ttgcttccct gagcgttggc Number of sequences in database: 3265 61 ccgggaggaa agaagatggt gctggatctg gatttgtttc gggtggataa aggaggggac 121 ccagccctca tccgagagac gcaggagaag cgcttcaagg acccgggact agtggaccag Lambda K H 181 ctggtgaagg cagacagcga gtggcgacga tgtagatttc gggcagacaa cttgaacaag 0.315 0.134 0.381 241 ctgaagaacc tatgcagcaa gacaatcgga gagaaaatga agaaaaaaga gccagtggga 301 gatgatgagt ctgtcccaga gaatgtgctg agtttcgatg accttactgc agacgcttta Gapped 361 gctaacctga aagtctcaca aatcaaaaaa gtccgactcc tcattgatga agccatcctg Lambda K H 421 aagtgtgacg cggagcggat aaagttggaa gcagagcggt ttgagaacct ccgagagatt 0.267 0.0410 0.140 481 gggaaccttc tgcacccttc tgtacccatc agtaacgatg aggatgtgga caacaaagta 541 gagaggattt ggggtgattg tacagtcagg aagaagtact ctcatgtgga cctggtggtg 601 atggtagatg gctttgaagg cgaaaagggg gccgtggtgg ctgggagtcg agggtacttc Matrix: BLOSUM62 661 ttgaaggggg tcctggtgtt cctggaacag gctctcatcc agtatgccct tcgcaccttg Gap Penalties: Existence: 11, Extension: 1 Sequence 721 ggaagtcggg gctacattcc catttatacc ccctttttca tgaggaagga ggtcatgcag Number of Hits to DB: 5,635,301 781 gaggtggcac agctcagcca gtttgatgaa gaactttata aggtgattgg caaaggcagt Number of Sequences: 3265 841 gaaaagtctg atgacaactc ctatgatgag aagtacctga ttgccacctc agagcagccc Number of extensions: 64670 901 attgctgccc tgcaccggga tgagtggctc cggccggagg acctgcccat caagtatgct Number of successful extensions: 222 961 ggcctgtcta cctgcttccg tcaggaggtg ggctcccatg gccgtgacac ccgtggcatc Number of sequences better than 1.0e-01: 6 1021 ttccgagtcc atcagtttga gaagattgaa cagtttgtgt actcatcacc ccatgacaac Number of HSP's better than 0.1 without gapping: 60 1081 aagtcatggg agatgtttga agagatgatt accaccgcag aggagttcta ccagtccctg Number of HSP's successfully gapped in prelim test: 11 1141 gggattcctt accacattgt gaatattgtc tcaggttctt tgaatcatgc tgccagtaag Number of HSP's that attempted gapping in prelim test: 156 1201 aagcttgacc tggaggcctg gtttccgggc tcaggagcct tccgtgagtt ggtctcctgt Number of HSP's gapped (non-prelim): 76 1261 tctaattgca cggattacca ggctcgccgg cttcgaatcc gatatgggca aaccaagaag length of query: 958 1321 atgatggaca aggtggagtt tgtccatatg ctcaatgcta ccatgtgcgc cactacccgt length of database: 2,520,180 1381 accatctgcg ccatcctgga gaactaccag acagagaagg gcatcactgt gcctgagaaa effective HSP length: 99 1441 ttgaaggagt tcatgccgcc aggactgcaa gaactgatcc cctttgtgaa gcctgcgccc effective length of query: 859 1501 attgagcagg agccatcaaa gaagcagaag aagcaacatg agggcagcaa aaagaaagca effective length of database: 2,196,945 1561 gcagcaagag acgtcaccct agaaaacagg ctgcagaaca tggaggtcac cgatgcttga effective search space: 1887175755 1621 acattcctgc ctccctattt gccaggcttt catttctgtc tgctgagatc tcagagcctg effective search space used: 1887175755 1681 cccaacagca gggaagccaa gcacccattc atccccctgc ccccatctga ctgcgtagct frameshift window, decay const: 50, 0.1 1741 gagaggggaa cagtgccatg taccacacag atgttcctgt ctcctcgcat gggcataggg T: 13 1801 acccatcatt gatgactgat gaaaccatgt aataaagcat ctctggggag ggcttaggac A: 40 1861 tcttcctcag tcttcttccc cgggcttgaa ccccgaaaaa aaaaaaaaaa aaaaaaaaaa X1: 16 ( 7.3 bits) 1921 aaaaaaaaaa aaaaaaaaaa aa X2: 38 (14.6 bits) // X3: 64 (24.7 bits) S1: 42 (22.0 bits) S2: 77 (34.3 bits)
III.8 3.2 Perl and Bioperl Perl and Perl objects (1/4) Programmer’s efficiency For most tasks in bioinformatics, our time is much more valuable than CPU time Highly suitable for biological sequence (string) manipulation Built-in regular expressions Abundance of existing, reusable code Bioperl Numerical methods Web programming Objects make programming faster and easier Complexity hiding, Abstraction, Code reuse
III.9 3.2 Perl and Bioperl Perl and Perl objects (2/4) Creating and using an object – an example of Bio::Seq use Bio::Seq; # import the module my $seqobj = Bio::Seq->new(-id => “MYSEQ_001”, -seq => “ACGCTAGGGATGGATAGGGATGGA”); print $seqobj->seq; # prints “ACGCTAGGGATGGATAGGGATGGA” print $seqobj->id; # prints “MYSEQ_001” my $revseqobj = $seqobj->revcom; # creates a new Bio::Seq object # with reverse complement of the # original sequence print $revseqobj->seq; # prints “TCCATCCCTATCCATCCCTAGCGT”
III.10 3.2 Perl and Bioperl Perl and Perl objects (4/4) Bio::Seq=HASH(0x8330dcc) '_as_feat' => ARRAY(0x831f6f4) Bio::Seq object empty array '_root_verbose' => 0 'annotation' => Bio::Annotation::Collection=HASH(0x831f850) '_annotation' => HASH(0x831f634) Bio::SeqFeature::Exon Bio::SeqFeature::Exon Bio::SeqFeature::Generic Bio::SeqFeature::CDS ... empty hash $seqobj : $seqobj : '_root_verbose' => 0 '_typemap' => Bio::Annotation::TypeManager=HASH(0x831f928) “logical” structure: internal structure '_root_verbose' => 0 Bio::Annotation::Collection Bio::PrimarySeq � primary sequence '_type' => HASH(0x831f79c) alphabet 'comment' => 'Bio::Annotation::Comment' (basic info) “dna” 'dblink' => 'Bio::Annotation::DBLink' Good news: you don’t � annotation collection 'reference' => 'Bio::Annotation::Reference' display_id empty 'primary_seq' => Bio::PrimarySeq=HASH(0x831f6e8) have to care about it. “MYSEQ_001” � list of sequence '_root_verbose' => 0 seq 'alphabet' => 'dna' features “ACGCTAGGGATGGATAGGGATGGA” 'display_id' => 'MYSEQ_001' 'seq' => 'ACGCTAGGGATGGATAGGGATGGA' All the components are accessed via specific methods: $seqobj->each_SeqFeature # returns the list of Bio::SeqFeature objects $seqobj->seq # returns the string “ACGCTAGGGATGGATAGGGATGGA” $seqobj->annotation # returns the Bio::AnnotationCollection object $seqobj->display_id # returns the string “MYSEQ_001”
III.11 3.2 Perl and Bioperl BioPerl (1/5) A library of Perl modules for managing and manipulating life- � science information Biological sequences and alignments � Reading and writing to files � Retrieval from existing databases � Running blast searches and the processing of results � Much, much more.... � Reduces otherwise complex tasks to only a few lines of code � http://bioperl.org �
III.12 3.2 Perl and Bioperl BioPerl (2/5) Some of the most frequently used classes of Bioperl objects Bio::Seq Bio::DB::EMBL seq() get_Seq_by_acc() length() get_Seq_by_id() start() get_Seq_by_gi end() get_Stream_by_acc() strand() get_Stream_by_id() desc() get_Stream_by_gi() Bio::Tools::Run::StandAloneBlast revcom() proxy() trunc() ua() exists_blast() id() ... blastall() accession_number() blasrpgp() species() bl2seq() Bio::SeqFeature::Generic annotation() all_SeqFeatures() location() add_SeqFeature() start() ... end() length() strand() Bio::SeqIO score() next_seq() frame() write_seq() add_tag_value() each_tag_value() ...
III.13 3.2 Perl and Bioperl BioPerl (3/5) Examples of bioperl usage (adapted from Stajich et al., Genome Res. 2002) To retrieve a sequence from EMBL and print it in GenBank format: use Bio::DB::EMBL; # import the required modules use Bio::SeqIO; my $db = Bio::DB::EMBL->new(); # $db is a Bio::DB::EMBL object my $seqobj = $db->get_Seq_By_acc(“U14680”); # $seqobj is a Bio::Seq object my $seqout = Bio::SeqIO->new(-format=>”genbank”); # $seqout is a new Bio::SeqIO object if (defined $seqobj) { # “if the database retrieval has succeeded” $seqout->write_seq($seqobj); # write sequence data from $seqobj } # to STDOUT(default) in genbank format (adapted from Stajich et al., Genome Res. 2002)
III.14 3.2 Perl and Bioperl BioPerl (3/5) Examples of bioperl usage Hierarchical structure of a BLAST report is decomposed into a hierarchy of object: Bio::SearchIO::Result::BLAST Bio::Search:: Bio::Search:: Bio::Search:: ... Hit::BlastHit Hit::BlastHit Hit::BlastHit Bio::SearchIO::Hit::BlastHit Bio::PrimarySeq alphabet Bio::Search:: Bio::Search:: Bio::Search:: ... “dna” HSP HSP HSP display_id “MYSEQ_001” Bio::SearchIO::Hit::HSP seq name “ACGCTAGGGATGGATAGGGATGGA” “M18833” description “Homo sapiens presenilin B” “” name “M18833” description “MYSEQ_001” seq “ACGCTAGGGATGGATAGGGATGGA”
Recommend
More recommend