gene regulation protein networks and disease a
play

Gene regulation, protein networks and disease a computational - PowerPoint PPT Presentation

Gene regulation, protein networks and disease a computational perspective Ron Shamir School of Computer Science Tel Aviv University CPM Helsinki July 3 2012 1 1 Outline Finding regulatory motifs I, II, III Utilizing


  1. Gene regulation, protein networks and disease – a computational perspective Ron Shamir School of Computer Science Tel Aviv University CPM Helsinki July 3 2012 1 1

  2. Outline • Finding regulatory motifs I, II, III • Utilizing case-control expression profiles and networks I, II DEGAS • Chromosomal aberrations in cancer 2

  3. Regulation of Transcription • A gene’s ranscription regulation is mainly encoded in the DNA in a region called the promoter • Each promoter contains several short DNA subsequences, called binding sites (BSs) that are bound by specific proteins called transcription factors (TFs) TF TF 5 ’ 3 ’ Gene BS BS  promoter 

  4. Position Weight Matrix (PWM) Score: product of 0.1 0.8 0 0.7 0.2 0 A base probabilities. 0 0.1 0.5 0.1 0.4 0.6 C Need score 0 0 0.5 0.1 0.4 0.1 G threshold for hits. 0.9 0.1 0 0.1 0 0.3 T ATGCAGGATACACCGATCGGTA 0.0605 GGAGTAGAGCAAGTCCCGTGA 0.0605 AAGACTCTACAATTATGGCGT 0.0151 4

  5. C. Linhart, Y. Halperin Gen enome Res e Resea earch 08 08 I. Finding Regulatory Motifs 5

  6. Motif discovery: The tw two-step tep s strategy egy Pr Promoter Co-reg Co egulated ed g gene set et sequences Cluster I Gene e exp xpression Clust stering microarray ays Cluster II Cluster III Motif discov overy Location a analysis (ChIP-chip, … (C …) Functional g l group (e (e.g., G GO term) 6

  7. Amad adeus us A Motif Algorithm for Detecting Enrichment in mUltiple Species Supp pports d diverse m motif d disco covery t tasks:  1. Find ove over-re repre resented motifs in given sets of of genes. 2. Identify motifs with global s l spatial f l feature res given onl nly the genomic sequences. How? w?  A general pipeli line a arc rchi hitecture for enumerating motifs.  Different statistical sc scoring sc scheme mes of motifs for  different motif discovery tasks. 7

  8. Motif search algorithm  Pipeline of refinement phases of increased complexity PW PWM Prepr process Mismat atch Merge  Phases: Optimiza zation Cutoff = = 0.005 005  Mo Motif Mo Model el: k -mer List o of k- mers PW PWM 8

  9. Scor coring ov over-rep epres esen ented ed m motifs  Input put: Target set (size T ) = co-regulated genes Background (BG BG) set (size B ) = entire genome  Mo Motif enri richment s sco cori ring: t B b  Hyper-geom ometric T GC GC-conte tent  Binne nned e enrichment nt s score 20-40 20 40% 40 40-60 60% B 1 B 2 0.4-0.7kb  Bino nomi mial T 2 T 1 b 1 b 2 Length Le kbp B 3 B 4 0.7-1kbp T 4 b 4 T 3 b 3 bp 9

  10. Metazoan motif discovery benchmark: 42 42 targ rget s sets of of 26 26 TFs, s, 8 8 miRNAs As from from 29 29 studies s (expre ression on, C , Chip-ChIP hIP,..) ,..) i in hu human, , mou ouse, , fly fly, w , worm orm. All ll m mot otifs fs a are re experi rimentally ve veri rified Ave verage t targ rget s set size: : 400 400 genes ( (383 383 Kb Kbp) ) 10

  11. 11

  12. 12

  13. Amade deus s – Global spatial analysis Co-re Co regula lated g gene set Gene e expression on Location anal analysis ( (ChIP-chip, … …) Promoter microarrays sequences Functi tional g group ( p (e.g., G GO te term) m) Output 13 Motif(s)

  14. Task II : Glo lobal a l analy lyse ses Scores for spatial features of motif occurrences In Input: Sequences (no target-set / expression data) Motif if s scorin ing:  Localization w.r.t the TSS TSS SS 5’  Strand-bias  Chromosomal preference 14

  15. Global analysis: Chromosomal preference in C. elegans Input: t: Re Results: Novel m l motif on on chro rom IV IV  All ll wo worm promoters rs (~ (~18 18,000 00) )  Score re: : chromosomal al prefere rence 15

  16. Global analysis: Chromosomal preference in C. elegans Input: t:  All ll wo worm promoters rs ( (~18 18,000 000) )  Score re: : chrom hromosomal p pre refe ference Re Results: Novel m l motif on on chr hrom om IV IV 16

  17. Y. Halperin, C. Linhart, I. Ulitsky NAR AR 1 0 1 0 II. Finding Transcriptional Programs 17

  18. Goal Given expression profiles, find the transcriptional programs active in them: - the co-regulated genes, - the motifs that govern their co- regulation

  19. Our goal oal: b : bypas ass t the two-step a approac ach Co-regulated gene set Expression data Simultaneous s Promoter Cluster I Gene infer erence o e of the e sequences expression motif tifs a and the Clustering microarrays Cluster II exp pr p profiles o of their ir t targe gets ts Cluster III Output Motif(s) 19

  20. Allegro: expression model  Discretization of expression patterns Discrete e expression on Ex Expressi ssion p pattern Pattern ( (DEP EP) e 1 =Up (U) ≥ 1.0 e 2 =Same (S) (-1.0, 1.0) c 1 c 2 … c m c 1 c 2 … c m e 3 =Down (D) ≤ -1.0 g -2.3 -0.8 1.5 g D S … U  Condition frequency matrix (CFM) c 1 c 2 … c m F = U 0.05 0.1 … 0.78 S 0.9 0.2 … 0.14 D 0.05 0.7 … 0.08  Condition weight matrix (CWM WM)     f = ( W ) ( R= { r ij } is the BG CFM)  ij  F log   r     ij ⇒ Log-likelihood ratio (LLR LLR) score 20

  21. Allegro overview 21

  22. Yeast osmotic shock pathway  ~6,000 genes, 133 conditions [O’Rourke et al. ’04]  Allegro can discover multiple motifs with diverse expression patterns, even if the response is in a small fraction of the conditions  Extant two-step techniques recovered only 4 of the above motifs:  K-means/C /CLI LICK + + Amadeus/W /Weeder: RRPE, PAC, MBF, STRE 22  Iclust st + + FIRE: E: RRPE, PAC, Rap1, STRE

  23. 3’ ’ UT UTR R an anal alysis: Hu Human an st stem c cells s  ~14,000 genes, 124 conditions (various types of proliferating cells) [Mueller et. al, Nature’08]  Biases in length / GC-content of 3’ UTRs, e.g.: 100 highly-expressed genes in… 3’ UTR: length GC Embryoid bodies 584 47% Undifferentiated ESCs 774 44% ESC-derived fibroblasts 1240 39% Fetal NSCs 1422 43% ( ESCs = embryonic stem cells, NSCs = neural stem cells)  Extant methods / Allegro with HG score: report only false positives 23

  24. Hu Human an st stem cells: s: results using binned score miRN RNA targets s Current expressi ssion expressi ssion knowledge Most highly expressed miRNAs in human/mouse ESCs Abundant & functional in neural cell lineage Expressed specifically in neural lineage; active role in neurogenesis 24 miRNA expression from [Laurent ’08]

  25. Yonit Halperin Chaim Linhart Igor Ulitsky Yaron Orenstein 25

  26. Open questions  Better PWM inference: new scores, algs  Richer models for in vivo / in vitro data – really helpful or diminishing return?  How to evaluate model quality: match to literature? Ranking based? In vivo? In vitro?  Integration of motif finding & expression  Principled means to find motif pairs 26

  27. I. Ulitsky, R. M. Karp RECOMB 09 09 I. Ulitsky, A. Krishnamurthy, R. M. Karp PLo LoS One ne 1 0 1 0 Using expression profiles and protein networks to understand cancer I 27 27

  28. DNA chips / Microarrays • Simultaneous measurement of expression levels of all genes. • Global view of cellular processes. • > 800,000 profiles available in ArrayExpress 28

  29. Protein-protein interactions (PPIs) • A regulates/binds to B • High throughput: abundant, noisy • Large, readily available resource 29

  30. Case/control studies • A typical study: 100s expression profiles of sick (case) & healthy samples (control) individuals genes • Classification: Given a partition of the samples into types, classify the types of new samples • Can the network help? sick healthy ? 30

  31. The network angle • Integrate case-control profiles with network information • Extract dysregulated pathways specific to the cases • Account for heterogeneity among cases • Meaningful pathway: connected 31

  32. Preprocessing • For each gene, use the Control 1 Control 2 Control 3 Control 4 Case 1 Case 2 Case 3 distribution of values among the controls to A B decide if the gene is C dysregulated in each of D E the cases Case 1 Case 2 Case 3 A Case 1 B 0 A 1 1 B 0 1 1 C Case 2 0 0 C 1 D 0 0 D 1 Case 3 1 E 1 1 E 32

  33. Dysregulated pathway • Input: – Bipartite graph: genes, cases – Edge (gene g, case c) if g is dysregulated in c – A network over the genes • Dysregulated pathway (DP): smallest connected subnetwork s.t. A A sufficiently many genes are ≥k Case 1 Case 1 B B dysregulated in all but few cases ≤l C C Case 2 Case 2 • Small pathway  focused disease D D Case 3 Case 3 explanation E E • Min connected set cover problem k= 2,l= 1 33

  34. Complexity • Set cover problem: Given sets of elements, find fewest sets that cover all elements k l G Problem 1 0 Clique Set cover k 0 Clique Set k-cover 1 >0 Clique Partial set cover 1 0 Any Connected set cover (Shuai & Hu 06) • All are NP-Hard • Devised approximation and heuristic algs DysrEgulated Gene set Analysis via Subnetworks 34 DEGAS

Recommend


More recommend