10-601 Machine Learning HMM applica)ons in computa)onal - PowerPoint PPT Presentation

10-601 Machine Learning HMM ¡applica)ons ¡in ¡computa)onal ¡ biology ¡

Central ¡dogma ¡ CCTGAGCCAACTATTGATGAA DNA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 ¡

Biological ¡data ¡is ¡rapidly ¡ accumula)ng ¡ Next ¡genera*on ¡sequencing ¡ Transcrip2on ¡factors ¡ DNA ¡ transcription RNA ¡ translation Proteins ¡

Biological ¡data ¡is ¡rapidly ¡ accumula)ng ¡ Array ¡/ ¡sequencing ¡ Transcrip2on ¡factors ¡ technology ¡ DNA ¡ transcription RNA ¡ translation Proteins ¡

Biological ¡data ¡is ¡rapidly ¡ accumula)ng ¡ Transcrip2on ¡factors ¡ Protein ¡interac*ons ¡ DNA ¡ transcription RNA ¡ translation Proteins ¡ • ¡38,000 ¡ ¡iden)fied ¡interac)ons ¡ • ¡Hundreds ¡of ¡thousands ¡of ¡ predic)ons ¡

FDA ¡Approves ¡Gene-‑Based ¡Breast ¡Cancer ¡ Test* ¡ “ ¡MammaPrint ¡is ¡a ¡DNA ¡ ¡ microarray-‑based ¡test ¡that ¡ measures ¡the ¡ac)vity ¡of ¡70 ¡ genes ¡in ¡a ¡sample ¡of ¡a ¡ woman's ¡breast-‑cancer ¡tumor ¡ and ¡then ¡uses ¡a ¡specific ¡ formula ¡to ¡determine ¡whether ¡ the ¡pa)ent ¡is ¡deemed ¡low ¡risk ¡ or ¡high ¡risk ¡for ¡the ¡spread ¡of ¡ the ¡cancer ¡to ¡another ¡site.” ¡ *Washington ¡Post, ¡2/06/2007 ¡

Ac)ve ¡Learning ¡ 9 ¡

Sequencing ¡DNA ¡ First ¡human ¡genome ¡draT ¡in ¡2001 ¡ Due ¡to ¡ accumulated ¡errors , ¡we ¡could ¡only ¡reliably ¡read ¡at ¡most ¡ 100-‑200 ¡nucleo*des. ¡ ¡

DARPA ¡Shredder ¡ Challenge ¡

Shotgun ¡Sequencing ¡ Wikipedia ¡

Caveats ¡ • ¡Errors ¡in ¡reading ¡ • ¡Non-‑trivial ¡assembly ¡task: ¡repeats ¡in ¡the ¡genome ¡ ¡ MacCallum ¡et ¡al., ¡GB ¡2009 ¡

Error ¡Correc*on ¡ in ¡DNA ¡sequencing ¡ • ¡The ¡fragmenta)on ¡happens ¡at ¡random ¡loca)ons ¡of ¡the ¡molecules. ¡ ¡ ¡We ¡expect ¡all ¡posi)ons ¡in ¡the ¡genome ¡to ¡have ¡the ¡same ¡# ¡number ¡of ¡ reads ¡ ¡ K-‑mers ¡= ¡substrings ¡of ¡length ¡K ¡of ¡the ¡reads. ¡Errors ¡create ¡error ¡k-‑mers. ¡ Kellly ¡et ¡al., ¡GB ¡2010 ¡

Transcriptome ¡Shotgun ¡Sequencing ¡(RNA-‑Seq) ¡ Sequencing ¡RNA ¡transcripts. ¡ @Friedrich ¡Miescher ¡Laboratory ¡ Reminder: ¡ • (mRNA) ¡Transcripts ¡are ¡“expression ¡products” ¡of ¡genes. ¡ • Different ¡genes ¡having ¡different ¡expression ¡levels ¡so ¡some ¡ transcripts ¡are ¡more ¡or ¡less ¡abundant ¡than ¡others. ¡ ¡

Challenges ¡ • Large ¡datasets: ¡10-‑100 ¡millions ¡reads ¡of ¡75-‑150 ¡bps. ¡ • Memory ¡efficiency: ¡Too ¡)me ¡consuming ¡to ¡perform ¡out-‑ memory ¡processing ¡of ¡data. ¡ ¡ ¡ DNA ¡Sequencing ¡ ¡+ ¡ others ¡: ¡ alterna)ve ¡slicing, ¡RNA ¡edi)ng, ¡ post-‑transcrip)on ¡modifica)on. ¡ ¡

Errors ¡are ¡non ¡uniformly ¡distributed ¡ • Some ¡transcripts ¡are ¡more ¡prone ¡to ¡errors ¡ • Errors ¡are ¡harder ¡to ¡correct ¡in ¡reads ¡from ¡lowly ¡expressed ¡transcripts ¡ ¡

SEECER ¡ Error ¡Correc*on ¡+ ¡Consensus ¡sequence ¡ es*ma*on ¡for ¡RNA-‑Seq ¡data ¡ ¡

Key ¡idea: ¡HMM ¡model ¡ Salmela ¡et ¡al., ¡Bioinforma)cs ¡2011 ¡ The ¡way ¡sequencers ¡work: ¡ • Read ¡leher ¡by ¡leher ¡sequen)ally ¡ • Possible ¡errors: ¡Inser)on ¡, ¡Dele)on ¡or ¡Misread ¡of ¡a ¡nucleo)de ¡

Building ¡(Learning) ¡the ¡HMMs ¡ and ¡Making ¡Correc*ons ¡(Inference) ¡ Learning ¡= ¡Expecta)on-‑Maximiza)on ¡ ¡ Inference ¡= ¡Viterbi ¡algorithm ¡ Seeding : ¡ ¡ Guessing ¡possible ¡reads ¡using ¡k-‑mer ¡overlaps. ¡ Construc)ng ¡the ¡HMM ¡from ¡these ¡reads. ¡ ¡ ¡ Speed ¡up: ¡ The ¡k-‑mer ¡overlaps ¡yield ¡approximate ¡mul)ple ¡alignments ¡of ¡reads. ¡ We ¡can ¡learn ¡HMM ¡parameters ¡from ¡this ¡directly. ¡

Clustering ¡to ¡improve ¡seeding ¡ Real ¡biological ¡differences ¡should ¡be ¡supported ¡by ¡a ¡set ¡of ¡reads ¡with ¡ similar ¡mismatches ¡to ¡the ¡consensus ¡

1. Clustering ¡posi)ons ¡with ¡mismatches ¡to ¡ iden)fy ¡clusters ¡of ¡correlated ¡posi)ons. ¡ 2. Build ¡a ¡similarity ¡matrix ¡between ¡these ¡ posi)ons. ¡ 3. Use ¡Spectral ¡clustering ¡to ¡find ¡clusters ¡of ¡ correlated ¡posi)ons. ¡ 4. Filter ¡reads ¡have ¡mismatches ¡in ¡these ¡clusters. ¡

Comparison ¡to ¡other ¡methods ¡

Using ¡the ¡corrected ¡reads, ¡the ¡assembler ¡can ¡ recover ¡ more ¡ transcripts ¡

Things ¡that ¡work ¡ • Approximate ¡learning ¡to ¡speed ¡up ¡on ¡large ¡datasets. ¡ • In ¡real ¡world, ¡one ¡technique ¡is ¡not ¡enough. ¡A ¡solu)on ¡involves ¡using ¡ many ¡techniques. ¡ • Precision ¡and ¡Recall ¡are ¡trade-‑offs. ¡ ¡

Central ¡dogma ¡ Different ¡regulators ¡control ¡the ¡informa)on ¡flow ¡from ¡DNA ¡to ¡protein ¡ Transcrip)on ¡factors ¡(TFs) ¡bind ¡to ¡DNA ¡and ¡ TF ¡ ac)vate ¡genes ¡ CCTGAGCCAACTATTGATGAA DNA transcription miR ¡ mRNA CCUGAGCCAACUAUUGAUGAA Micro ¡RNAs ¡(miRs) ¡bind ¡to ¡mRNA ¡to ¡down ¡ regulate ¡their ¡expression ¡ translation Protein PEPTIDE 28 ¡

Integra)ng ¡expression ¡and ¡protein-‑ DNA ¡interac)on ¡data ¡ Lee ¡et ¡al ¡Science ¡ 2002 ¡ Bar-‑Joseph ¡et ¡al ¡Nature ¡Biotechnology ¡ 2003 ¡

Methods ¡for ¡reconstruc)ng ¡ ¡ networks ¡in ¡cells ¡ Amit et al Venancio et al Genome Science 2009 Biology 2009 Gerstein et al Science 2010

Key ¡problem: ¡Most ¡high-‑throughput ¡ data ¡is ¡sta)c ¡ Time-‑series ¡measurements ¡ Sta)c ¡data ¡sources ¡ Sequencing ¡ mo)f ¡ CHIP-‑chip ¡ microarray ¡ PPI ¡ Time ¡

DREM: ¡Dynamic ¡Regulatory ¡Events ¡Miner ¡

a ¡ Time ¡Series ¡Expression ¡Data ¡ b ¡ Sta*c ¡TF-‑DNA ¡Binding ¡Data ¡ ¡ Expression ¡ TF ¡A ¡ Level ¡ TF ¡B ¡ )me ¡ TF ¡D ¡ TF ¡C ¡

a ¡ Time ¡Series ¡Expression ¡Data ¡ b ¡ Sta*c ¡TF-‑DNA ¡Binding ¡Data ¡ ¡ Expression ¡ TF ¡A ¡ Level ¡ TF ¡B ¡ )me ¡ TF ¡D ¡ TF ¡C ¡ c ¡ Model ¡Structure ¡ IOHMM ¡Model ¡ d ¡ 0.1 ¡ Expression ¡ Level ¡ ? ¡ 0.95 ¡ 0.9 ¡ 1 ¡ ? ¡ )me ¡ 0.05 ¡ 1 ¡

Things ¡are ¡a ¡bit ¡more ¡complicated: ¡Real ¡ data ¡

A ¡Hidden ¡Markov ¡Model ¡ Hidden ¡States ¡ H 0 H 2 H 3 H 1 1 ¡ Observed ¡outputs ¡ O 0 O 1 O 2 O 3 (expression ¡levels) ¡ t=0 ¡ t=1 ¡ t=2 ¡ t=3 ¡ n T T ⎡ ⎤ ⎡ ⎤ L ( H , O ; ) p ( O ( i ) | H ( i )) p ( H ( i ) | H ( i )) ∏ ∏ ∏ Θ = ⎢ ⎥ ⎢ ⎥ t t t t 1 − ⎣ ⎦ ⎣ ⎦ i 1 t 1 t 2 = = = Schliep ¡et ¡al ¡ Bioinforma2cs ¡2003 ¡

Input ¡– ¡Output ¡Hidden ¡ Markov ¡Model ¡ Input ¡ (Sta)c ¡TF-‑gene ¡interac)ons) ¡ I g Hidden ¡States ¡ (transi)ons ¡ between ¡states ¡form ¡a ¡tree ¡ H 0 H 2 H 3 H 1 structure) ¡ Emissions ¡ (Distribu)on ¡of ¡ O 0 O 1 O 2 O 3 expression ¡values) ¡ t=1 ¡ t=2 ¡ t=3 ¡ t=0 ¡ Log ¡Likelihood ¡ But ¡how ¡do ¡we ¡express ¡these ¡condi)onal ¡probabili)es? ¡ Product ¡over ¡all ¡Gaussian ¡ Sum ¡over ¡all ¡ Sum ¡over ¡ Product ¡over ¡all ¡transi)on ¡probabili)es ¡on ¡path ¡ ¡ ¡ emission ¡density ¡values ¡ paths ¡ Q ¡ all ¡genes ¡ on ¡path ¡

Input-‑Output ¡Hidden ¡Markov ¡Model ¡ learning ¡the ¡transi)on ¡probabili)es ¡ I g How ¡do ¡compute ¡ P ¡for ¡a ¡state ¡with ¡2 ¡children? ¡ We ¡can ¡write ¡it ¡as ¡a ¡logis)c ¡regression ¡ ¡ 2 H 2 classifica)on ¡problem! ¡ 1 = q (2)| H 1 = q (1), I g ) = ? H 1 P ( H 2 P ( H 2 2 = q (2)| H 1 = q (1), I g ) = ? 1 H 2 O 1 t=1 ¡ O 2 t=2 ¡

10-601 Machine Learning HMM applica)ons in computa)onal - PowerPoint PPT Presentation

10-601 Machine Learning HMM applica)ons in computa)onal biology Central dogma CCTGAGCCAACTATTGATGAA DNA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Ac#ve Learning Aarti Singh Machine Learning 10-601 Dec 6, 2011 Slides Courtesy: Burr

Update on Therapies for Idiopathic Pulmonary Fibrosis Paul Wolters Associate Professor University

WHO WHY HOW ? GOAL to share the love that we are receiving from our Lord. Make friends, be

Flex Ray: Coding and Decoding, Media Access Control, Frame and Symbol Processing Seminar: The

Type-Logical Grammar and Natural Language Syntax Yusuke Kubota University of Tsukuba

Financial Disclosures Vertebral body stapling in children with idiopathic Theologis: none

Challenges for Socially-Beneficial AI Daniel S. Weld University of Washington Outline

Beneficial AI Daniel S. Weld 1 Outline Distractions Important Concerns Unemployment

The Wizardry of Scaling Ayende Rahien / Oren Eini http://ayende.com/Blog ayende@ayende.com