4/17/09 CSCI1950‐Z Computa4onal Methods for Biology Lecture 19 Ben Raphael April 13, 2009 hIp://cs.brown.edu/courses/csci1950‐z/ Copy number Aberra4ons Normal cells: Cancer cells: Extensive gene duplica4on/dele4on Red and gene probes to regions on X chromosome: >2 copies of green region hIp://www.coriell.org/images/gm10636fish.jpg hIp://www.imppc.org/ 1
4/17/09 DNA Microarrays Measuring Muta4ons in Cancer Compara4ve Genomic Hybridiza4on (CGH) 2
4/17/09 CGH Analysis (1) Divide genome into segments of equal copy number 0.5 Log 2 (R/G) 0 Genomic posi4on ‐0.5 Dele4on Amplifica4on 0.5 0 Genomic posi4on ‐0.5 CGH Analysis (2) Iden4fy aberra4ons common to mul4ple samples Chromosome 3 of 26 lung tumor samples on mid‐ Samples density cDNA array. Common dele4on located in 3p21 and common amplifica4on – in 3q. 3
4/17/09 CGH Analysis (1) Divide genome into segments of equal copy number Copy number profile Numerous methods Copy number (e.g. clustering, Hidden Markov Model, Bayesian, Genome etc.) coordinate Segmenta4on Input : X i = log 2 T i / R i , clone i = 1, …, N Output : Assignment s(y i ) ∈ {S 1 , …, S K } S i represent copy number states An Approach to CGH Segmenta4on • Circular Binary Segmenta4on (CBS), Olshen et al. 2004 • Use hypothesis test to compare means of two intervals using t‐test (whiteboard) Dele4on Amplifica4on 0.5 Genomic posi4on 0 ‐0.5 4
4/17/09 Interval Score Lipson, et al. J. Computa7onal Biology , 2006 Assume: • X i are independent, normally distributed • µ and σ denote the mean and standard devia4on of the normal genomic data. Given an interval I spanning k probes, we define its score as: Significance of Interval Score Lipson, et al. J. Computa7onal Biology , 2006 Assume: • X i ~ N( µ, σ ) 5
4/17/09 The MaxInterval Problem A vector X =( X 1 … X n ) Input: Output: An interval I ⊂ [1… n ], that maximizes S( I ) Other intervals with high scores may be found by recursively calling this func4on. Exhaus4ve algorithm: O(n 2 ) MaxInterval Algorithm I: LookAhead Assume given: • m : An upper bound for the value of a single element X i • t : A lower bound on the maximum score I = [ i, …, i + k ‐1] I ’ = [ i ,…, i+k+r ‐1] s = Σ j ∈ I Xj s + m r sum k + r length k score Solve for first r for which S( I ) may exceed t. Complexity: Expected O( n 1.5 ) (unproved) 6
4/17/09 MaxInterval Algorithm II: Geometric Family Approximation (GFA) For ε >0 define the Ω (j 1 ) following geometric Ω (j 2 ) family of intervals: Ω (j 3 ) Δ j k j Theorem: Let I * be the op4mal scoring interval. Let J be the leomost longest interval of Ω fully contained in I *. Then S( J ) ≥ S( I *)/ α , where α ∝ (1‐ ε ‐2 ) ‐1 . Complexity: O( n ) Benchmarking synthe4c vectors of varying lengths Linear regression suggests that the complexi4es of the Exhaus4ve, LookAhead and GFA algorithms are O(n 2 ), O(n 1.5 ), O(n), respec4vely. 7
4/17/09 Applications: Single Samples 1 A2BP1 FRA16B Chromosome 16 of Log 2 (ra4o) HCT116 colon carcinoma 0 cell line on high‐density oligo array (n=5,464). ‐1 0 25 50 75 Mbp ERBB2 1 Chromosome 17 of several 0 breast carcinoma cell lines on 1 mid‐density cDNA array Log 2 (ra4o) 0 (n=364). 1 0 1 0 0 25 50 75 Mbp Another Approach to CGH Segmenta4on Use Hidden Markov Model (HMM) to “parse” sequence of probes into copy number states Dele4on Amplifica4on 0.5 Genomic posi4on 0 ‐0.5 8
4/17/09 Hidden Markov Models 1 1 1 1 … 1 … 2 2 2 2 2 2 … … … … K K K K … K x 1 x 2 x 3 x K Example: The Dishonest Casino A casino has two dice: Fair die • P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die • P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back‐&‐forth between fair and loaded die once every 20 turns Game: 1. You bet $1 2. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $2 9
4/17/09 Ques4on # 1 – Evalua4on GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 Prob = 1.3 x 10 ‐35 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Ques4on # 2 – Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 FAIR LOADED FAIR QUESTION What por4on of the sequence was generated with the fair die, and what por4on with the loaded die? This is the DECODING ques4on in HMMs. This is what we want to solve for CGH analysis 10
4/17/09 Ques4on # 3 – Learning GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 Prob(6) = 64% QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How ooen does the casino player change from fair to loaded, and back? This is the LEARNING ques4on in HMMs Defini4on of a hidden Markov model DefiniSon: A hidden Markov model (HMM) • Alphabet Σ = { b 1 , b 2 , …, b M } Set of states Q = { 1, ..., K } • Transi4on probabili4es between any two states • 1 2 a ij = transi4on prob from state i to state j a i1 + … + a iK = 1, for all states i = 1…K • Start probabili4es a 0i K … a 01 + … + a 0K = 1 • Emission probabili4es within each state e i (b) = P( x i = b | π i = k) e i (b 1 ) + … + e i (b M ) = 1, for all states i = 1…K 11
4/17/09 The dishonest casino model 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(1|L) = 1/10 P(2|F) = 1/6 P(2|L) = 1/10 0.05 P(3|F) = 1/6 P(3|L) = 1/10 P(4|F) = 1/6 P(4|L) = 1/10 P(5|F) = 1/6 P(5|L) = 1/10 P(6|F) = 1/6 P(6|L) = 1/2 A HMM is memory‐less 1 2 At each 4me step t, the only thing that affects future states is the current state π t K … P( π t+1 = k | “whatever happened so far”) = P( π t+1 = k | π 1 , π 2 , …, π t , x 1 , x 2 , …, x t ) = P( π t+1 = k | π t ) 12
4/17/09 A parse of a sequence Given a sequence x = x 1 ……x N , A parse of x is a sequence of states π = π 1 , ……, π N … 1 1 1 1 1 … 2 2 2 2 2 2 … … … … … K K K K K x 1 x 2 x 3 x K Likelihood of a Parse Simply, mulSply all the orange arrows ! (transi4on probs and emission probs) … 1 1 1 1 1 … 2 2 2 2 2 2 … … … … … K K K K K x 1 x 2 x 3 x K 13
4/17/09 Likelihood of a parse A compact way to write 1 1 1 1 1 … a 0 π 1 a π 1 π 2 ……a π N‐1 π N e π 1 (x 1 )……e π N (x N ) 2 2 2 2 2 2 … Number all parameters a ij and e i (b); n params Given a sequence x = x 1 ……x N … … … … Example: and a parse π = π 1 , ……, π N , a 0Fair : θ 1 ; a 0Loaded : θ 2 ; … e Loaded (6) = θ 18 K K K K … K Then, count in x and π the # of 4mes each parameter j = 1, …, n occurs To find how likely is the parse: x 1 x 2 x 3 x K (given our HMM) F(j, x, π ) = # parameter θ j occurs in (x, π ) P(x, π ) = P(x 1 , …, x N , π 1 , ……, π N ) = (call F(.,.,.) the feature counts ) Then, P(x N , π N | π N‐1 ) P(x N‐1 , π N‐1 | π N‐2 )……P(x 2 , π 2 | π 1 ) P(x 1 , π 1 ) = P(x N | π N ) P( π N | π N‐1 ) ……P(x 2 | π 2 ) P( π 2 | π 1 ) P(x 1 | π 1 ) P( π 1 ) = P(x, π ) = Π j=1…n θ j F(j, x, π ) = a 0 π 1 a π 1 π 2 ……a π N‐1 π N e π 1 (x 1 )……e π N (x N ) = exp [ Σ j=1…n log( θ j ) × F(j, x, π ) ] Example: the dishonest casino Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 Then, what is the likelihood of π = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say ini4al probs a 0Fair = ½, a oLoaded = ½) ½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ × (1/6) 10 × (0.95) 9 = .00000000521158647211 ~= 0.5 × 10 ‐9 14
4/17/09 Example: the dishonest casino So, the likelihood the die is fair in this run is just 0.521 × 10 ‐9 OK, but what is the likelihood of π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ × P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½ × (1/10) 9 × (1/2) 1 (0.95) 9 = .00000000015756235243 ~= 0.16 × 10 ‐9 Therefore, it somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood π = F, F, …, F ? ½ × (1/6) 10 × (0.95) 9 = 0.5 × 10 ‐9 , same as before What is the likelihood π = L, L, …, L? ½ × (1/10) 4 × (1/2) 6 (0.95) 9 = .00000049238235134735 ~= 0.5 × 10 ‐7 So, it is 100 4mes more likely the die is loaded 15
4/17/09 HMM Model for CGH data A+ A+ C+ C+ G+ G+ Fridlyand et al. (2004) S1 S1 S2 S2 S3 S3 S4 S4 A model for CGH data K states S 1 S 2 S 3 S 4 copy numbers 1 2 3 4 Homozygous Heterozygous Normal Duplication Deletion Deletion (copy =2) (copy >2) (copy =1) (copy =0) µ 1 , , σ 1 µ 2 , σ 2 µ 3 , σ 3 µ 4 , σ 4 1 Emissions: Copy number Gaussians Genome coordinate 16
Recommend
More recommend