GS 559 Winter 2010 Lecture 11 Sequence Motifs Larry Ruzzo New - PowerPoint PPT Presentation

GS 559 Winter 2010 Lecture 11 Sequence Motifs Larry Ruzzo New Web Soon(but old links should redirect): http://www.cs.washington.edu/homes/ruzzo/courses/gs559/10wi

Who Am I? Prof. Computer Science & Engineering Adjunct Prof., Genome Sciences Joint Member, FHCRC Main research interest: noncoding RNA http://www.cs.washington.edu/homes/ruzzo ruzzo@uw.edu 554 CSE, 543-6298 Office Hours: Mondays 2:30-3:20, or by appt

Outline Bioinformatics: Sequence Motifs Sequence Logos Weight Matrix Models (WMMs) aka Position Specific Scoring Matrices (PSSMs, possums) aka 0th order Markov models Construction, statistics, uses Programming: Regular expressions

Motifs Motif : “a recurring salient thematic element”

MyoD http://www.rcsb.org/pdb/explore/jmol.do?structureId=1MDY&bionumber=1

Sea Urchin - Endo16

Sequence Motifs Motif : “a recurring salient thematic element” E.g., structural motifs in proteins (zinc finger, H-T -H, leucine zipper, ... are various DNA binding motifs) E.g., the DNA sequence motifs to which these proteins bind - e.g. , one leucine zipper dimer might bind (with varying affinities) to 10s or 100s or 1000s of similar sequences

E. coli Promoters “TATA Box” ~ 10bp upstream of transcription start How to define it? TACGAT Consensus is TATAAT TAAAAT TATACT BUT all differ from it GATAAT Allow k mismatches? TATGAT Equally weighted? TATGTT Wildcards like R,Y? ({A,G}, {C,T}, resp.)

E. coli Promoters “TATA Box” - consensus TATAAT ~10bp upstream of transcription start Not exact: of 168 studied (mid 80’s) – nearly all had 2/3 of TAxyzT – 80-90% had all 3 – 50% agreed in each of x,y,z – no perfect match (Other common features at -35, etc.)

TATA Box Frequencies pos 1 2 3 4 5 6 base A 2 94 26 59 50 1 C 9 2 14 13 20 3 G 10 1 16 15 13 0 T 79 3 44 13 17 96 Sequence Logo http://weblogo. berkeley.edu

Frequencies Frequency ⇒ Scores: pos 1 2 3 4 5 6 base log 2 (freq/background) A 2 94 26 59 50 1 C 9 2 14 13 20 3 G 10 1 16 15 13 0 T 79 3 44 13 17 96 Scores pos 1 2 3 4 5 6 base A -36 19 1 12 10 -46 (For convenience, scores multiplied by C -15 -36 -8 -9 -3 -31 10, then rounded) G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19

Scanning for TATA A -36 19 1 12 10 -46 C -15 -36 -8 -9 -3 -31 = -90 G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 A C T A T A A T C G A -36 19 1 12 10 -46 C -15 -36 -8 -9 -3 -31 = 85 G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 A C T A T A A T C G A -36 19 1 12 10 -46 C -15 -36 -8 -9 -3 -31 = -91 G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 A C T A T A A T C G Stormo, Ann. Rev. Biophys. Biophys Chem, 17, 1988, 241-263

Scanning for TATA 100 85 66 50 50 23 Score 0 -50 -100 -90 -91 -150 A C T A T A A T C G A T C G A T G C T A G C A T G C G G A T A T G A T

TATA Scan at 2 genes LacI 50 Score -50 -150 LacZ 50 Score -50 -150

Score Distribution (Simulated) 3500 3000 2500 2000 1500 1000 500 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90

Weight Matrices: Thermodynamics Experiments show ~80% correlation of (log likelihood) weight matrix scores to measured binding energy of RNA polymerase to variations on TATAAT consensus [Stormo & Fields]

What’s best WMM? Given, say, 168 sequences s 1 , s 2 , ..., s k of length 6, assumed to be generated at random according to a WMM defined by 6 x (4-1) parameters θ , what’s the best θ ? Answer: count frequencies per position. More justification next time, but if you saw 900 Heads in1000 coin flips, you’d perhaps estimate P(Heads) = 900/1000

Pseudocounts Freq/count of 0 ⇒ - ∞ score; a problem? Certain that a given residue never occurs in a given position? Then - ∞ just right. Else, it may be a small-sample artifact Typical fix: add a pseudocount to each observed count—small constant (e.g., .5, 1) Sounds ad hoc ; there is a Bayesian justification Influence fades with more data

How-to Questions Given aligned motif instances, build model? Frequency counts (above, maybe w/ pseudocounts) Given a model, find (probable) instances Scanning, as above Given unaligned strings thought to contain a motif, find it? (e.g., upstream regions of co- expressed genes) Hard ... maybe another lecture.

WMM Summary Weight Matrix Model (aka Position Specific Scoring Matrix, PSSM, “possum”, 0th order Markov models) Simple statistical model assuming independence between adjacent positions To build: align, count (+ pseudocount) letter frequency per position, log likelihood ratio to background To scan: add per position scores, compare to threshold, slide Databases & tools: Transfac, Jaspar, MEME/MAST, ...

GS 559 Winter 2010 Lecture 11 Sequence Motifs Larry Ruzzo New - PowerPoint PPT Presentation

GS 559 Winter 2010 Lecture 11 Sequence Motifs Larry Ruzzo New Web Soon(but old links should redirect): http://www.cs.washington.edu/homes/ruzzo/courses/gs559/10wi Who Am I? Prof. Computer Science & Engineering Adjunct Prof., Genome

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of

EE-559 Deep learning 9.3. Visualizing the processing in the input Fran cois Fleuret

EE-559 Deep learning 7. Networks for computer vision Fran cois Fleuret

EE-559 Deep learning 6. Going deeper Fran cois Fleuret https://fleuret.org/dlc/ [version

DATA MINING (EC 559) Dr. Dhaval Patel CSE, IIT-Roorkee General Information Instructor:

Quantitative Security Colorado State University Yashwant K Malaiya CS 559 L6: Probability &

EE-559 Deep learning 8. Under the hood Fran cois Fleuret https://fleuret.org/dlc/

EE-559 Deep learning 1a. Introduction Fran cois Fleuret https://fleuret.org/dlc/

EE-559 Deep learning 1b. PyTorch Tensors Fran cois Fleuret https://fleuret.org/dlc/

Nebraska Challenge Set Exercise February, 2012 1 Continuing Education Fax to 402.559.7799

August 2018 Company Profile AECC, JSC is a State Corporation Rosatom company which is a

Avoiding Wage and Hour Litigation L a ura A. Wo lfe 559-433-1300 Wage and Hour Lawsuits Have

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and

Network Mo3fs Subnetworks with more occurrences than expected by chance. How to find?

Towards self-learning agents in era of high-throughput omics Presenter: Ameen Eetemadi Principal

Middle Rio Grande (MRG) Municipal Separate Storm Sewer System (MS4) Permit J o h n K a y E n v

toxin-producing E. coli in veal Data statistical analysis Kyriaki Project description Food

FOOD SAFETY OUTBREAKS: TOUGH LESSONS LEARNED Room 314 | December 6 2017 CEUs New Process

Produce Safety Educators Call #28 January 8, 2018 Instructions All participants are

Identifying Origins of Replication Sites in Circular Genomes

Inferring transcriptional and microRNA-mediated regulatory programs in glioblastma Setty, M., et

GS 559 Winter 2010 Lecture 11 Sequence Motifs Larry Ruzzo New - PowerPoint PPT Presentation

GS 559 Winter 2010 Lecture 11 Sequence Motifs Larry Ruzzo New Web Soon(but old links should redirect): http://www.cs.washington.edu/homes/ruzzo/courses/gs559/10wi Who Am I? Prof. Computer Science & Engineering Adjunct Prof., Genome

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of

EE-559 Deep learning 9.3. Visualizing the processing in the input Fran cois Fleuret

EE-559 Deep learning 7. Networks for computer vision Fran cois Fleuret

EE-559 Deep learning 6. Going deeper Fran cois Fleuret https://fleuret.org/dlc/ [version

DATA MINING (EC 559) Dr. Dhaval Patel CSE, IIT-Roorkee General Information Instructor:

Quantitative Security Colorado State University Yashwant K Malaiya CS 559 L6: Probability &amp;

EE-559 Deep learning 8. Under the hood Fran cois Fleuret https://fleuret.org/dlc/

EE-559 Deep learning 1a. Introduction Fran cois Fleuret https://fleuret.org/dlc/

EE-559 Deep learning 1b. PyTorch Tensors Fran cois Fleuret https://fleuret.org/dlc/

Nebraska Challenge Set Exercise February, 2012 1 Continuing Education Fax to 402.559.7799

August 2018 Company Profile AECC, JSC is a State Corporation Rosatom company which is a

Avoiding Wage and Hour Litigation L a ura A. Wo lfe 559-433-1300 Wage and Hour Lawsuits Have

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and

Network Mo3fs Subnetworks with more occurrences than expected by chance. How to find?

Towards self-learning agents in era of high-throughput omics Presenter: Ameen Eetemadi Principal

Middle Rio Grande (MRG) Municipal Separate Storm Sewer System (MS4) Permit J o h n K a y E n v

toxin-producing E. coli in veal Data statistical analysis Kyriaki Project description Food

FOOD SAFETY OUTBREAKS: TOUGH LESSONS LEARNED Room 314 | December 6 2017 CEUs New Process

Produce Safety Educators Call #28 January 8, 2018 Instructions All participants are

Identifying Origins of Replication Sites in Circular Genomes

Inferring transcriptional and microRNA-mediated regulatory programs in glioblastma Setty, M., et

Quantitative Security Colorado State University Yashwant K Malaiya CS 559 L6: Probability &