csi5126 algorithms in bioinformatics
play

CSI5126 . Algorithms in bioinformatics Deterministic Sequence Motifs - PowerPoint PPT Presentation

. PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions CSI5126 . Algorithms in bioinformatics Deterministic Sequence Motifs Marcel Turcotte School of Electrical Engineering and


  1. . Preamble . . . . . . . . Preamble Words PRINTS Regular Expressions Words . PRINTS Regular Expressions Issues Brazma, A., Jonassen, I., Eidhammer, I. & Gilbert, D. Approaches to the automatic discovery of patterns in biosequences. J Comput Biol 5, 279–305 (1998). How to represent patterns ? How to search for a pattern ? How to discover patterns automatically ? Let’s distinguish between two kinds of motifs/patterns : deterministic and probabilistic . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  2. . PRINTS . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words Regular Expressions . How to defjne a motif ? The most basic pattern is a substring (aka rigid pattern). We have seen algorithms to process strings : exact and approximate string matching. Search algorithm. Fast algorithms exist to check for the presence of a motif, Boyer & Moore for example ; Motif discovery. The longest common substring of K strings can be found with help of generalized suffjx trees ; Mismatches can be allowed, mismatch check algorithm ; Insertions/deletions and weighted alphabet scoring scheme (string edit distance) are also possible. substrings. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ BLOCKS and PRINTS are examples of databases that contain

  3. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Automated approaches to detect conserved substrings Overrepresented l -mers. Find an effjcient algorithm to enumerate conserved or overrepresented l -mers ( l -words appearing k times in the input string (genome), or l -words appearing in at least k input strings (genes)). What are the pros/cons of these approaches (or representation) ? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  4. . Motifs are not 100 % conserved . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions attgcgggacgcggCGCATTCCgaaacggaagccgatgat . agctctccgggactcgtagccaaCGGATCCGaatctagataatagtggcaatca atgtcgactacgcaggttCGCATCGCaaacagcccggga ttacgagtagcctctgaaactcCGCATCCGtaagggtgccaagaattaagt gacatcacactacgCGCACCCCacgtgtatttctt atgggacggcgtacggCACATCCCtctttgcgaggcg catttgtaattgtggaccacCACATCCCctagacaccagatacgcgg agggtcgcgtactgtaagCGCATCGCgagtgcaaagatgaaa gtcgtttaaacagTGCATCCGaaccgcagccgtag tggtaccgacccccTGCATCCCgtgagtgtaattcaattta CGCATCCC Here, the consensus sequence is not found in any of the input sequences. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  5. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Inferring motifs automatically : Median string problem Input : K input sequences and the length l of the motif to be found. Problem : Find a string v (of length l ) minimizing K min Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ∑ i k ∈ [ 1 .. | S k | ] d Hamming ( v , S k [ i k , i k + l − 1 ]) k = 1

  6. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Inferring motifs automatically : Median string problem (cont.) 1. attgcgggacgcggCGCATTCCgaaacggaagccgatgat CGCATCCC 8 CGCATCCC 8 CGCATCCC 5 CGCATCCC 7 ... CGCATCCC 1 ... Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  7. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Inferring motifs automatically : Median string problem (cont.) K. agctctccgggactcgtagccaaCGGATCCGaatctagataatagtggcaatca CGCATCCC 4 CGCATCCC 6 CGCATCCC 5 CGCATCCC 8 ... CGCATCCC 2 ... Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  8. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Inferring motifs automatically : Median string problem (cont.) Given v , calculating Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ∑ K k = 1 min i k ∈ [ 1 .. | S k | ] d Hamming ( v , S k [ i k , i k + l − 1 ]) .

  9. . Preamble . . . . . . . . . . Words . PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Inferring motifs automatically : Median string problem (cont.) For small values of l , an exhaustive search can be considered, for instance there are 65,536 8-mers. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics However, there are 4 l choices of v .

  10. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Exhaustive search Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . v(1) A C G T v(2) A C G T A C G T A C G T A C G T v(3) ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT

  11. . Preamble . . . . . . . . Preamble Words PRINTS Regular Expressions Words . PRINTS Regular Expressions Branch-and-bound Traverse the search tree ( depth-fjrst using a stack or best-fjrst using a priority queue). If current node is a leaf and the total distance of the motif represented by the leaf and the K input sequences is less than best then set best to the score of this motif and memorize the current motif. If the current node is an internal node and its total distance is larger than best than prune this sub-tree. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Set best to ∞ .

  12. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions Branch-and-bound (cont.) How to improve this approach ? Finding more aggressive bounds. structured motifs using a suffjx tree with an application to promoter and regulatory site consensus identifjcation. J. Comput. Biol. 7 (3-4) :345–62. E Eskin, PA Pevzner (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18 Suppl 1 :S354-63. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics L. Marsan, M.-F. Sagot (2000) Algorithms for extracting

  13. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Branch-and-bound Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . v(1) A C G T v(2) A C G T A C G T A C G T A C G T v(3) ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT ACGT

  14. . PRINTS . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words Regular Expressions . Practical application : PRINTS “PRINTS is a compendium of protein fjngerprints. A fjngerprint is a group of conserved motifs used to characterize a protein family” ; Release 39.0 of PRINTS (02.02.2009) contains 1950 entries ; bioinf.man.ac.uk/dbbrowser/PRINTS/ Attwood, T.K., Mitchell, A., Gaulton, A., Moulton, G. & database : functional and evolutionary applications. In Encyclopaedia of Genetics, Genomics, Proteomics and Bioinformatics , M.Dunn, L.Jorde, P.Little & A.Subramaniam (Eds.). John Wiley & Sons. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Tabernero, L. (2006) The PRINTS protein fjngerprint

  15. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Practical application : PRINTS (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  16. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Practical application : PRINTS (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  17. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Practical application : PRINTS (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  18. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Practical application : PRINTS (cont.) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  19. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions PRINTS OPSIN Entry The degree of conservation along a multiple sequence alignment (MSA) varies ; An MSA often consists of a number of blocks with a high degree of conservation, interspersed by more variable regions ; Each entry in PRINTS consists of a collection of ungapped, unweighted local alignments ; In PRINTS, 3 conserved segments of the OPSIN alignment serve to represent the OPSIN motif. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  20. . PRINTS Entry/Header Links: elements Type of fingerprint: COMPOUND with 3 Opsin signature rmanuOPSIN View alignment WORKLIST ENTRIES (1): Regular Expressions PRINTS; PR00249 GPCRSECRETIN; PR00250 GPCRSTE2; PR00899 GPCRSTE3 PRINTS Words Preamble Regular Expressions PRINTS Words Preamble PRINTS; PR00237 GPCRRHODOPSN; PR00247 GPCRCAMP; PR00248 GPCRMGR PRINTS; PR00251 BACTRLOPSIN . Visual pigments are the light-absorbing molecules that mediate vision [1,2]. Marcel Turcotte ... conformational change in the protein. chromophore, which is isomerised to the all-trans form, promoting a cis-retinal. Vision is effected through the absorption of a photon by the They comprise an apoprotein (opsin), covalently linked to the chromophore (...) PRINTS; PR00574 OPSINBLUE; PR00575 OPSINREDGRN; PR00576 OPSINRH1RH2 Creation date 20-DEC-1993; UPDATE 22-JUN-1999 BLOCKS; BL00238 PROSITE; PS00238 OPSIN INTERPRO; IPR001760 PRINTS; PR00667 RPERETINALR PRINTS; PR00666 PINOPSIN; PR00579 RHODOPSIN; PR00239 RHODOPSNTAIL PRINTS; PR00577 OPSINRH3RH4; PR00578 OPSINLTRLEYE; PR01244 PEROPSIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  21. . 123 codes involving 123 123 3| 123 COMPOSITE FINGERPRINT INDEX 2 elements 7 codes involving 3 elements SUMMARY INFORMATION 5 PRINTS Entry/Diagnostic Regular Expressions PRINTS Words Preamble Regular Expressions PRINTS 2| 3 Preamble OPSD_TRIMA Marcel Turcotte .... OPSD_MACFA OPSD_PIG OPSD_CRIGR OPSD_MOUSE OPSD_RABIT OPSD_CANFA 6 OPSD_CHICK True positives: 3 2 1 | --+---------------- Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  22. . AATMKFKKLRHPL 73 OPS1_DROME YIFATTKSLRTPA 76 76 OPSR_HUMAN AATMKFKKLRHPL 76 76 OPSG_HUMAN 60 VATLRYKKLRQPL 60 OPSD_SHEEP YVTVQHKKLRTPL 60 60 OPSD_HUMAN YVTVQHKKLRTPL 60 60 OPSD_BOVIN YVTVQHKKLRTPL 73 OPSB_HUMAN ST 77 Marcel Turcotte ... 57 57 OPSD_LOLFO YLFTKTKSLQTPA 58 58 OPSD_OCTDO YLFSKTKSLQTPA 77 57 OPS4_DROME WIFSTSKSLRTPS 81 81 OPS3_DROME WVFSAAKSLRTPS 80 80 OPS2_DROME YIFGGTKSLRTPA 57 INT PCODE . . . . . . . . . . . . . . . . . . . . . . . . . . . Opsin motif I - 1 Words Motif number = 1 Length of motif = 13 OPSIN1 PRINTS Entry/Motifs Regular Expressions PRINTS Words Preamble Regular Expressions PRINTS Preamble . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  23. . GWSRYWPHGLKTS 187 OPS1_DROME GWSRYVPEGNLTS 101 190 OPSR_HUMAN GWSRYWPHGLKTS 101 190 OPSG_HUMAN 101 GWSRFIPEGLQCS 174 OPSD_SHEEP GWSRYIPQGMQCS 101 174 OPSD_HUMAN GWSRYIPEGLQCS 101 174 OPSD_BOVIN GWSRYIPEGMQCS 101 OPSB_HUMAN ST 100 Marcel Turcotte ... 103 173 OPSD_LOLFO GWGAYTLEGVLCN 103 174 OPSD_OCTDO NWGAYVPEGILTS 190 171 OPS4_DROME FWDRFVPEGYLTS 100 194 OPS3_DROME TWGRFVPEGYLTS 101 194 OPS2_DROME GWSAYVPEGNLTA 101 INT PCODE . . . . . . . . . . . . . . . . . . . . . . . . . . . Opsin motif II - 1 Words Motif number = 2 Length of motif = 13 OPSIN2 PRINTS Entry/Motifs Regular Expressions PRINTS Words Preamble Regular Expressions PRINTS Preamble . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  24. . PLMAALPAFFAKS 308 OPS1_DROME PLNTIWGACFAKS 98 301 OPSR_HUMAN PLMAALPAYFAKS 98 301 OPSG_HUMAN 98 LRLVTIPSFFSKS 285 OPSD_SHEEP PIFMTIPAFFAKS 98 285 OPSD_HUMAN PIFMTIPAFFAKS 98 285 OPSD_BOVIN PIFMTIPAFFAKT 108 OPSB_HUMAN ST 110 Marcel Turcotte ... 108 294 OPSD_LOLFO PYAAQLPVMFAKA 108 295 OPSD_OCTDO PYAAELPVLFAKA 313 282 OPS4_DROME QGATMIPACTCKL 110 317 OPS3_DROME PGATMIPACACKM 108 315 OPS2_DROME PLTTIWGATFAKT 98 INT PCODE . . . . . . . . . . . . . . . . . . . . . . . . . . . Opsin motif III - 1 Words Motif number = 3 Length of motif = 13 OPSIN3 PRINTS Entry/Motifs Regular Expressions PRINTS Words Preamble Regular Expressions PRINTS Preamble . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  25. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Deriving a motif “(…) from a small multiple sequence alignment, conserved motifs are identifjed and excised manually for database searching (…)” ; “Results are examined manually (…)” ; “(…) if there are more matches than were in the initial alignment, the additional information from these new sequences is added to the motifs.” ; “(…) the database is searched again.” ; “This iterative process is repeated until no further complete fjngerprint matches can be identifjed.” Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  26. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions PRINTS : Summary Pros : Since raw alignments are stored, they can be used to derive regular expressions, profjles, etc. ; High signal-to-noise ratio (curated database) ; Combination of local motifs together with the iterative process helps detecting more remote homologues. Cons : Human intervention (construction/interpretation) high ; Lack of a theory for composite motifs. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  27. . Preamble . . . . . . . . Preamble Words PRINTS Regular Expressions Words . PRINTS Regular Expressions Substring Motifs : Cons Selecting appropriate parameters : number of mismatches, gap penalty, etc. ; Pairwise sequence comparison might not be applicable : sequences do not align on their entire length or are too divergent ; Sometimes we would like to emphasize that certain identities are mandatory . (See WW domain for instance) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  28. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions Motifs : Regular Expressions Regular expressions are often used to represent key residues composing a motif. A large database of regular expressions exists : PROSITE . Methods have been developed to derive automatically PROSITE signatures : see PRATT (Pattern driven) and eMOTIF (data driven). and fjnite state automaton. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ Consult the appendix for a brief summary of regular expressions

  29. . created by hand. . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions How to ? Most of the regular expressions found in PROSITE have been Build a multiple alignment ; . Reduce the alignment to a concensus regular expression ; Refjne the expression base database search results. Alignment Regular expression -------------------------------------------------------- ADLGAVFALCDRYFQ [AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ * * * * * Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  30. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions How to ? (cont.) Prosite Perl {PG} [ˆPG] x4 .{4} “-” are simply spacers. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ http ://www.expasy.ch/prosite/

  31. . How to ? (contd) . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Sometimes, such patterns are published (might not be in . the form of a regular expression, but as a list of functionally important residues and their spacing) ; Starts with a group or family of sequences ; Identify regions of the alignment that are important for function , ideally these are supported by experimental evidences, such as : enzyme catalytic site, prostethic group (heme, etc.) attachment sites, metal binding sites, disulfjde bonds, binding a molecule (ATP, Calcium, DNA, etc) ; residues, scan a sequence database with the core pattern, normally this would also match non-members, then the pattern is further extended. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Identify core residues in the region, < 4 or 5 conserved

  32. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions How to ? (contd) (cont.) * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH[DE] Experimental data might suggest that the histidine participates to the active site, a fjrst pattern is constructed ATH[DE], which is used to scan a sequence database, if no false positive, then fjne, otherwise extend pattern, may involve starting from a new core pattern. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  33. . Regular Expressions . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Prosite . URL = www.expasy.ch/prosite Release 20.131 of 27-Oct-2016 contains 1773 documentation entries, 1309 patterns, 1172 profjles and 1193 ProRule. Approximately 146Mb, updated twice per year. Typically, a rule involves 10-20 conserved residues. Pros/Con s : Biased towards sensitivity at the expense of specifjcity ( many false positives ) ; Documented (biological properties of the family/domain) ; Maintained ; Tightly linked to the development of SwissProt. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ Now also part of InterPro : www.ebi.ac.uk/interpro.

  34. . PRINTS . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words Regular Expressions . Prosite Motifs What are they ? Short universal motifs : N-glycosylation site N-{P}-[ST]-{P} Phosphorylation site [ST]-x-[RK] Another phosphorylation site [ST]-x(2)-[DE] Asp or Asn hydroxylation site C-x-[DN]-x(4)-[FY]-x-C-x-C Some have a structural basis, WW, helix-turn-helix ; Families. by chance when matched against SwissProt ? SwissProt is a popular sequence database. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ How many hits for the pattern N-{P}-[ST]-{P} would occur

  35. . Preamble . . . . . . . . . . Words . PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions SCOP : Protein Structure Classifjcation Brenner, S. E. et al. (1996) Understanding protein structure : using SCOP for fold interpretation. Methods in Enzymology , 266 :635–643. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . ⇒ Class ⇒ Fold ⇒ Superfamily ⇒ Family ⇒ Domain

  36. . Regular Expressions . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS PROSITE Matches/SCOP . 6 % Universal, phosphorylation, amidation, etc. 17 % Specifjc to a class. 8 % Specifjc to a fold. 17 % Specifjc to a superfamily. 12 % Specifjc to family. 40 % Specifjc to a sub-set of a family. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  37. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions Automated approaches Issues related to automated pattern discovery : Search space Valid regular expressions Algorithm Pattern driven (PRATT) Data driven (eMOTIF) Evaluation function (a measure of surprise) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  38. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Preliminaries : information theory The information content measures the reduction of the uncertainly (also called entropy ) after some message has been received. In the case of regular expression motifs, the interpretation is “how much information is gained by knowing that a sequence segment matches a given regular expression”. Merriam-Webster Online about “entropy” : 1 : ... usually considered to be a measure of the system's disorder ... 3 : Chaos, disorganization, randomness. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  39. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Uncertainty Information is based on the notion uncertainty about an event — what symbol do you expect to fjnd at a given position of the sequence ? Uncertainty is defjned as follows, M Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ∑ H = − P i log 2 P i i = 1

  40. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Uncertainty (cont.) Consider a sample space that has two outcomes , one occurring with probability p , and the other outcome occurring with Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics probability 1 − p .

  41. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Uncertainty (cont.) The above picture shows how the entropy varies as a function of p . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . 1.0 0.8 0.6 H 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 p

  42. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Uncertainty (cont.) In particular, you can clearly see that the entropy is maximum where M is the number of outcomes (the cardinality of the sample Notice also that the entropy approaches zero , whenever the probability of one of the events approaches 1 (and hence, the probabilities of the other events approach 0). This models quite well the concept of uncertainty (entropy). When all the outcomes are equiprobable you can’t predict the result of an experiment, but any bias towards one of the outcomes reduces the uncertainty. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics when the events are all equiprobable , its value is then log 2 M bits, space M = | S | ). Here, the entropy maximum is log 2 2 = 1 bit.

  43. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Uncertainty (cont.) Then entropy is maximal when the M outcome s are equally likely , and zero when only one outcome out of M occurs. Consider the case where all the outcomes are equiprobabl e, 1 1 M 1 M 1 M Marcel Turcotte . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . P i = 1 M for all i ∈ 1 . . . M . − ∑ = i = 1 .. M P i log 2 P i − ∑ = i = 1 .. M M log 2 − M × 1 = M log 2 − log 2 = log 2 M

  44. . Preamble . . . . . . . . . . Words . PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Uncertainty (cont.) Finally, consider the case where one outcome occurs with probability 0. the uncertainty is zero as expected. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . probability 1 , and the other M − 1 outcomes occur with ∑ − ( 1 × log 2 1 + 0 × log 2 0 ) i = 1 .. M , P i ̸ = 1 ∑ − ( 1 × 0 + 0 ) i = 1 .. M , P i ̸ = 1 ⇒ It is customary to let 0 log 0 = 0.

  45. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Information content The information content is defjned as, i.e. the difgence of entropy between two probability distributions. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics I = H before − H after

  46. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Information content Considering a wild card , [ACGT] , no information is gained. 4 1 1 1 4 1 1 1 Marcel Turcotte . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . Consider the case of DNA strings, Σ = { A , C , G , T } , where all four bases are equiprobable, i.e. P i = 0 . 25. ∑ ∑ I = H before − H after = ( − 4 × log 2 4 ) − ( − 4 × log 2 4 ) = 2 − 2 = 0

  47. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Information content When a regular expression contains a single character , say C , then Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics the amount of information gained is maximal log 2 4 = 2 bits. I = H before − H after = 2 − 0 = 2

  48. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions Information content In the case of a character class containing two elements, [AG] , 4 1 1 1 1 1 bit of information is gained. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ∑ I = ( − 4 × log 2 4 ) − ( − [ 2 ( 1 2 ) + 2 ( 0 log 2 0 )] 2 log 2

  49. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Information content The information content for a regular expression will be the sum of the information content at each position , Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics I G [ GA ] C [ ACGT ] = 2 + 1 + 2 + 0 = 5.

  50. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Exercise Consider an organism whose genome has the following nucleotide 6 . Calculate the Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics frequencies : P A = 1 6 , P C = 1 3 , P G = 1 3 , P T = 1 information content of the following expression G [ GA ] C [ ACGT ] . I G [ GA ] C [ ACGT ] = I G + I [ GA ] + I C + I [ AGT ]

  51. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Comparing motifs, signals, active sites, etc. www.lecb.ncifcrf.gov/˜toms/sequencelogo.html weblogo.berkeley.edu Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  52. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions FYI : Claude Shannon – Father of the Information Age Half hour video presenting Claude Shannon’s work. “This fascinating program explores his life and the major infmuence his work had on today’s digital world through interviews with his friends and colleagues.” (includes comments from Andrew Viterbi, Ian Blake, and others) www.ucsd.tv/search-details.asp ?showID=6090 cm.bell-labs.com/cm/ms/what/shannonday/paper.html Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  53. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Motifs : Regular Expressions Regular expressions are often used to represent key residues forming motifs. A large database of regular expressions exists : PROSITE. Methods have been developed to derive automatically PROSITE signatures : see PRATT and eMOTIF. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  54. . Preamble . . . . . . . . . . Words . PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Things we like about REs ! Allow to model mandatory amino acids. Easy to interpret in terms of biological concepts, such as binding sites, etc. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  55. . Preamble . . . . . . . . Preamble Words PRINTS Regular Expressions Words . PRINTS Regular Expressions Issues Human intervention high , often derived from literature ; Subjective choice of the region in some cases ; availabl e ; Too rigid ! Does not allow for mismatches ; Compromise between sensitivity/sensibility, fmexibility/noise ; Will not perform well on new entries ( overfjtting ) ; Short motifs can occur by chance . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Entries must be revised as new sequences become

  56. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Pattern Discovery Approaches to derive patterns automatically can be classifjed as “ pattern driven ” ( PRATT ) or “ data (sequence) driven ” ( eMOTIF ). Issues : Search algorithm ; Performance measure or fjtness function. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  57. . Pattern driven approaches (PRATT [1]) . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Input : A set of related but unaligned sequences. . Problem : Constructs automatically regular expressions (patterns) consisting of single letter ( A ), character classes ( [KER] ) and range patterns ( x-(i,j) ). For example, A-x-[KER]-x(2)-D-[ILV]-E-x(4)-[KR] Based on an exhaustive search from the most general motifs to the most specifjc ones . This is done in two steps : Single letter patterns search, A-x(4)-D-x-E ; Pattern refjnement, A-x-[KER]-x(2)-D-[ILV]-E-x(4)-[KR] . www.ii.uib.no/˜inge/Pratt.html Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  58. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Step 1 : Single Letter Pattern Search Starting with the empty pattern (most general motif) all possible extensions of a motif are considered. The process is repeated recursively unless a pattern does not match the required minimum number matches c (coverage, support). This is a tree-based search with pruning based on coverage. search tree) is extended with all the possible suffjxes of the form Notice that i and j can both be of length zero, which corresponds to an extension of a single letter. For some small t and large c it’s possible to exhaustively search the space of all possible motifs. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . Specifjcally, a regular expression α (corresponding to a node of the − x ( i , j ) − β for 0 ≤ i ≤ j ≤ t and β ∈ Σ .

  59. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Step 1 : Single Letter Pattern Search (cont.) . Marcel Turcotte . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . α α -x(i, j)- β ... α -x(0 ,0)-A α -x(0 , 1 )-A α -x( 1 , 1 )-A α -x(0 α -x(0 α -x( ,0)-C , 1 )-C 1 , 1 )-C α -x(0 α -x(0 α -x( ,0)-G , 1 )-G 1 , 1 )-G α -x(0 ,0)-T α -x(0 , 1 )-T α -x( 1 , 1 )-T ...

  60. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions Step 1 : Single Letter Pattern Search (cont.) Why is the introduction of the character classes delayed until the refjnement step ? Character classes are not introduced earlier because there are two many of them ! How many ? Since the extensions represent the branching factor of the tree-based search, it cannot be afgorded. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics 2 | Σ | . Consider the case where Σ represents all 20 amino acids 2 20 = 1 , 048 , 576.

  61. . Regular Expressions . . . . . . . . Preamble Words PRINTS Preamble . Words PRINTS Regular Expressions Step 1 (contd) Given the pattern P, the children of P are as follows. P-x(0,0)-A ... P-x(0,0)-Y P-x(0,1)-A ... P-x(0,1)-Y ... P-x(0,5)-A .. P-x(0,5)-Y ... P-x(5,5)-A ... P-x(5,5)-Y The resulting patterns are checked against the set of sequences and retained if they match enough sequences. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  62. . PRINTS . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words Regular Expressions . Step 2 : pattern refjnement refjnement. Nota : The information that I could obtained is vague. As far as I understand, the list of groups is supplied by the user (a nice way to derive groups will be presented along with the presentation on eMOTIF [2]). In the meantime, you can image that the groups are obtained from the Venn diagrams based on the properties of amino acids, tiny= [SGA] , small= [SGAPTNCV] , … For a given pattern, all the sequences that it matches are retrieved. group, if it exists, that is a superset of the amino acids found this position. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Wildcard positions, x ( i , j ) , such that i = j are considered for For all the positions k of all the range patterns − x ( i , i ) fjnd a

  63. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Step 2 : pattern refjnement (cont.) Marcel Turcotte . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . aliphatic hydrophobic small M aromatic I L V tiny F A C Y W G H K S P T R D E N positive Q negative charged polar

  64. . Step 2 : pattern refjnement (cont.) . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Example. Consider the following user defjned groups. . [FAMILYVW] [KREND] [PGSTQ] [HC] The expression C-x(3,3)-C matches the following three sequences. CDFGC CEIMC CRIMC The amino acids at the second position are a subset of the group [KREND] , those at the third position are a subset of the group [FAMILYVW] , but there are no groups containing M and G. The following expressions can be derived C-x(3,3)-C , C-[KREND]-x(2,2)-C , C-x(1,1)-[FAMILYVW]-x(1,1)-C and C-[KREND]-[FAMILYVW]-x(1,1)-C . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  65. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Step 2 : pattern refjnement (cont.) have the minimum coverage), “a heuristic refjnement algorithm” is used. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Given k wild cards, 2 k expressions can derived (not all of them will

  66. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Scoring Pattern PRATT has three scoring schemes : Positive Predictive Value (PPV) (requires a set of negative examples) Information Content (default) Minimum Description Length (MDL) (takes into account the number of matches and the complexity of the motif) alternatively, a Z-score (aka standard score, normal score) could be used as a measure of surprise, factor. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics z ( w ) = f ( w ) − E ( w ) N ( w ) where f ( w ) is the number of observed occurrences, E ( w ) is the expected number of occurrences, and N ( w ) is a normalization

  67. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions PRATT Pros : Automated approach ; Uses unaligned sequences. Cons : Unsatisfactory solution to the over-fjtting problem. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  68. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Data driven approaches (eMOTIF) Automatically defjned motifs ; Strategies to overcome the rigidity of REs : Classes of amino acids ; Regular expressions with approximate matching ; agrep (allow 0, 1, 2, 2 or 4 mismatch(es)) ; Variable specifjcity. The eMOTIFS are derived from the multiple sequence alignments in the BLOCKS+ database, the PRINTS database, and the eBLOCKS database. Originally constituted of 50,000 motifs from 7,000 alignments. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ motif.stanford.edu

  69. . MFGKRAFVHHYVGEGMEENEFTDARQDLYELEVDYANL . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Input data MFRRKAFLHWYTGEGMDEMEFTEAESNMNDPVAEYQQY MFRRKAFLHWYTGEGMDEMEFTEAESNMNDPVAEYQQY MFKKRAFVHWYVGEGMEEGEFTEARENIAVLERDFEEV . MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV Marcel Turcotte MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MFKRKAFLHWYTGEGMDEMEFTEAESNMNDLVSEYQQY MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MFRRKAFLHWFTGEGMDEMEFTEAESNMNDLVSEYQQY MFKRKGFLHWYTGEGMEPVEFSEAQSDLEDLILEYQQY MFKRKAFLHWYTSEGMDELEFSEAESNMNDLVSEYQQY MFKRKAFLHWYTGEGMDEMEFTEVRANMNDLVAEYQQY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  70. . Preamble . . . . . . . . . . Words . PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Creating a Motif : ad hoc Each position consists of a character class that contain all the observed amino acids at that position. The motif for that block would start with M[FY][AGKR] . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  71. . MFKRKAFLHWYTSEGMDELEFSEAESNMNDLVSEYQQY MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MFRRKAFLHWFTGEGMDEMEFTEAESNMNDLVSEYQQY MFKRKGFLHWYTGEGMEPVEFSEAQSDLEDLILEYQQY MFKRKAFLHWYTGEGMDEMEFTEVRANMNDLVAEYQQY MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MFKRKAFLHWYTGEGMDEMEFTEAESNMNDLVSEYQQY MFKKRAFVHWYVGEGMEEGEFTEARENIAVLERDFEEV MFGKRAFVHHYVGEGMEENEFTDARQDLYELEVDYANL MFRRKAFLHWYTGEGMDEMEFTEAESNMNDPVAEYQQY MFRRKAFLHWYTGEGMDEMEFTEAESNMNDPVAEYQQY Creating a Motif : ad hoc (cont.) Regular Expressions PRINTS MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV Preamble V Marcel Turcotte V R V A RS VYV N R QQ MNE VL MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV M K T VKENLEDPISEYQQL E L YVS YGRRG V MFAKKAFLHWFTGEGMDEGEFSEAEADIAALEKDFEEY MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV Words Regular Expressions . . . . . . . . . . . . . . . . . . . . . . PRINTS . Words Preamble . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  72. . . . . . . . . . . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions What do you think ? Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  73. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions Remarks The ad hoc motif is too specifjc. For example, position 3 contains amino acids that have nothing in common. Evolution does not constrain this position. It can be expected that most mutations at that position would be tolerated ; including mutations to an amino acid type other than [AGKR] . Because RE are deterministic (match/not match), several true positive will be missed. Over-fjtting problem ! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  74. . Words . . . . . . . Preamble Words PRINTS Regular Expressions Preamble PRINTS . Regular Expressions eMOTIF : Substitution groups Input : Columns of multiple sequence alignments from BLOCKS and HSSP. properties : 1. All the amino acids from this group substitute frequently with other amino acids from the same group ( compactness ) ; 2. All the amino acids that are not part of the group substitute with members of the group with low frequencies ( isolation ) ; 20 groups were found. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Of all 2 k subsets, select all the groups with the following

  75. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words . Regular Expressions eMOTIF : Substitution groups one another. Marcel Turcotte . PRINTS . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . MIVLFY MIVLF IVLFY MIVL RKQE IVL LFY FWY RKQ KQE TSA IV FY YH RK QE ED DN TS SA M I V L F W Y H R K Q E D N T S A F C G P ⇒ Amino acids of the same group are more likely to substitute for

  76. . Words . . . . . . . . . . Preamble PRINTS . Regular Expressions Preamble Words PRINTS Regular Expressions Creating a Motif : most specifjc motif Each position consists of the most specifjc substitution group that contains all the amino acid types observed at that position. Observation. For a given set of input sequences the most specifjc motif is unique. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  77. . MFKRKGFLHWYTGEGMEPVEFSEAQSDLEDLILEYQQY MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MFRRKAFLHWFTGEGMDEMEFTEAESNMNDLVSEYQQY MFKRKAFLHWYTSEGMDELEFSEAESNMNDLVSEYQQY MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MFKRKAFLHWYTGEGMDEMEFTEVRANMNDLVAEYQQY MFKRKAFLHWYTGEGMDEMEFTEAESNMNDLVSEYQQY MFKKRAFVHWYVGEGMEEGEFTEARENIAVLERDFEEV MFGKRAFVHHYVGEGMEENEFTDARQDLYELEVDYANL MFRRKAFLHWYTGEGMDEMEFTEAESNMNDPVAEYQQY Creating a Motif : most specifjc motif (cont.) Regular Expressions PRINTS MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV Preamble V Marcel Turcotte Y V V R L M Q EYQQI MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV K NL T E Y L Y RR MF.KKAFIHWF..EGMDE.EFSE.E.DI.....DFEEF MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV Words Regular Expressions . . . . . . . . . . . . . . . . . . . . . . PRINTS . Words Preamble . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  78. . Remarks . . . . . Preamble Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Consider position 3, there is no group that contains “G”, “R”, “S” . and “A”, therefore a wild-card is inserted at that position. Consider position 8, although “I” is not observed at that position, we can expect that other members of this family would have an “I” at that position since “L” and “V” often substituted by “I”. The most specifjc motif is more general than the ad hoc motif. The most specifjc motif is sensitive to noise, consider the 8th position from the right, all the sequences have an “L” at that position but the fjrst one has a “P”. This could be the result of an experimental error. However the most specifjc motif will have a wild-card because of that. Consequently, the RE may be too general and will produce many false positive results ! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  79. . Words . . . . . . . . . . Preamble PRINTS . Regular Expressions Preamble Words PRINTS Regular Expressions Exploring the space of RE motifs Because some RE may be too general and will produce many false positive results, we would like to explore the space of possible REs for fjnding new ones that are more specifjc (but also cover fewer sequences). Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  80. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Coverage/Sensitivity eMOTIF proposes an ensemble of motifs with difgerent coverage and sensitivity. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ⇒ Ideal motif would be found in the bottom-right corner.

  81. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Coverage/Specifjcity eMOTIF exhaustively generates all possible motifs using the allowable substitution groups. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  82. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Probability that a motif matches a random sequence Assumptions : AA are independent and identically distributed. AA distribution estimated from the observed frequencies from SWISSPROT. Wild card characters (.) matches with probability 1. The amino acids probabilities are estimated from a large database. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics P ( M [ FWY ] . [ KR ] . . . [ FYW ]) p ( M ) × [ p ( F )+ p ( W )+ p ( Y )] × 1 × [ p ( K )+ p ( R )] . . . [ p ( F )+ p ( W )+ p ( Y )]

  83. . Preamble . . . . . . . . Preamble Words PRINTS Regular Expressions Words . PRINTS Regular Expressions Choosing the right RE When using an RE for detecting new members of a sequence family, the expected number of random sequences matching the RE should be less than 1. The expected number of matches depends on the size of the database ! where N is the size of the database. N . Obviously, such RE will match fewer sequences ! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics P RE × N You should select an RE with probability less 1

  84. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Disjunction of REs can be used to represent a family sequence for a database of size N . Remove all the sequences that it matches and apply the algorithm to the remaining sequences. A family is therefore represented by a disjunction of REs (high specifjcity and coverage). AKA sequential covering in machine learning. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Find an RE with probability 1 N of matching a random sequence

  85. . . . . . . . . . . . . Preamble . Words PRINTS Regular Expressions Preamble Words PRINTS Regular Expressions Size of the space of motifs is the number of character classes that are used to construct the Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics The space of all the possible motifs is huge : ( m + 20 ) n , where m motifs and n is the number of columns, e.g. ( 20 + 20 ) 38 ≃ 10 60 .

  86. . PRINTS . . . . . . . . . Preamble Words Regular Expressions . Preamble Words PRINTS Regular Expressions Exploring the space of all possible motifs : Solution 1 Each subset of sequences induces a most specifjc motif. Let’s generate the most specifjc motif for all the subsets of the input sequences. The number of motifs is independent of the number of columns and the number of groups ! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics For 10 sequences there are 1,024 ( = 2 10 ) most specifjc motifs, which is much less than ( ( 20 + 20 ) 10 ≃ 10 16 . However, for 158 sequences, there are 10 48 subsets …

Recommend


More recommend