U-curve Search for Biological States Characterization and Genetic Network Design Marcelo Ris – Universidade de São Paulo – Instituto de Matemática e Estatística Junior Barrera – Universidade de São Paulo – Instituto de Matemática e Estatística Helena Brentani - Hospital do Câncer, Fundação Antônio Prudente
Outline • Introduction • Feature selection problem • U-curve search algorithm • Characterization of biological states • Genetic network design • Application
• Introduction • Feature selection problem • U-curve search algorithm • Characterization of biological states • Genetic network design • Application
• Biological Problems P1. Biological states characterization P2. Genetic Network Design • Gene expression data P1. States samples P2. Time-course samples • Mathematical approach – Feature Selection Problem – State of the art: heuristic optimizations – U-curve algorithm
• Introduction • Feature selection problem • U-curve search algorithm • Characterization of biological states • Genetic network design • Application
P ( X , Y ) � {( x [ 0 ], y [ 0 ]), ( x [ 1 ], y [ 1 ]), , ( x [ m ], y [ m ])} � � [ ] x t 1 � � x [ t ] � � 2 = x [ t ] � � � � � � � x [ t ] n R A → | | ψ : K Feature Selection � � � �� � � � ∈ = ∈ x [ t ] R { r r r r i 1 2 l i � ∈ = y [ t ] K { 1 , , c } � ⊂ A { 1 , 2 , , n }
Y Distribution − → P : { 1 , 0 , 1 } [ 0 , 1 ] � P(Y) = P ( y ) 1 y ∈ { − 1 , 0 , 1 } Y -1 0 1 Y Entropy P(Y’) � = − H ( Y ) P ( y ) log P ( y ) ∈ − y { 1 , 0 , 1 } > H ( Y ) H ( Y ' ) Y’ -1 0 1 = H ( Y ' ) H ( Y ' ' ) P(Y’’) Mutual Information Y’’ -1 0 1 = − ≥ ( , ) ( ) ( | ) 0 I X Y H Y H Y X
Mean Conditional Entropy � � = E [ H ( Y | X )] P ( x ) P ( y | x ) log P ( y | x ) X Y | X Y | X ∈ − ∈ − x { 1 , 0 , 1 } y { 1 , 0 , 1 } Estimation � � � � � � = E [ H ( Y | X )] P ( x ) P ( y | x ) log P ( y | x ) X Y | X Y | X ∈ − ∈ − x { 1 , 0 , 1 } y { 1 , 0 , 1 } Mean Mutual Information = − E [ I ( X , Y )] H ( Y ) E [ H ( Y | X )] Estimation � � � = − E [ I ( X , Y )] H ( Y ) E [ H ( Y | X )]
• Problem – find the subset A that optimizes the cost function – Ex: mean conditional entropy minimization (cost function) – Exponential • Search Space – Complete boolean lattice of order n – Each node represents a possible candidate A – Cost function: estimated for each node – Find the node with the minimum cost
Boolean Lattice of order 4 4-element chain is emphasized • Heuristics: SFS, SFFS – Incremental – Does not search all the candidates space – Could not obtain the “best” result • Ex: 2 elements alone turns the result worse, but together improves it a lot
• Introduction • Feature selection problem • U-curve search algorithm • Characterization of biological states • Genetic network design • Application
• U-curve property of Ê[H(Y|X)] Ë[H(Y|X)] – For a fixed number of samples – For any chain of the search space – Ê[H(Y|X)] forms an U-curve |A| – Why ? – Estimation composed by: • Real measure – decreases from H(Y) to the real value E[H(Y|X)] • Estimation error – increases as more attributes are added to X
• Features of the algorithm – Branch-and-Bound: go through the whole space without having to visit all the candidates – Stochastic – Some definitions: • U-cost Boolean Lattice • Local minimum • Exhausted minimum • Global minimum
• Search space characterized by: – Upper Bound List – Lower Bound List An element is reachable if • there is a chain from an 10 upper or lower list element 1110 • At each step: – Select with some 6 probability a beginning list 0110 – Select an aleatory Prune element from this list Procedure – Build a chain iteratively: • Inserts to the chain an 7 aleatory reachable 0100 adjacent to the last one • Stop, when the cost of the last element is greater than the last 9 one 0000
• Additional Procedures – Minimum exhausting • Avoid more than one visit to the same candidate • Using a stack – Pruning elements from an element E • Upper bound list – remove elements U’s that contain E , and inserts elemets reachable from U that not contain E • Lower bound list – remove elements L’s that are contained in E , and inserts elemets reachable from L that is not contained in E
• Introduction • Feature selection problem • U-curve search algorithm • Characterization of biological states • Genetic network design • Application
P ( X , Y ) Quantized Microarray � {( x [ 0 ], y [ 0 ]), ( x [ 1 ], y [ 1 ]), , ( x [ m ], y [ m ])} � � [ ] x t 1 � � x [ t ] � � 2 = x [ t ] � � � � � � � x [ t ] n R A → | | ψ : K U-curve algorithm Quantized Values � � ∈ = ∈ x [ t ] R { r , r , , r }, r i 1 2 l i � ∈ = y [ t ] K { 1 , , c } � ⊂ A { 1 , 2 , , n } Biological States
• Introduction • Feature selection problem • U-curve search algorithm • Characterization of biological states • Genetic network design • Application
• Dynamical Systems – State: vector x – Transition function � – x [ t+1 ] = � ( x [ t ]) • Stochastic Process – Stochastic transition function • Next State – aleatory vector realization – Ex: Markov Chain ( � X|Y , � 0 ) • Time-discrete, finite-size vector, finite domain • Aleatory state sequence � � � p p p p � � p 1 | 1 2 | 1 3 | 1 n | R | | 1 � � � � 1 � � � p p p p � � p 1 | 2 2 | 2 3 | 2 n | R | | 2 2 � � � � � π = π = p p p p p � � � � 0 3 Y | X 1 | 3 2 | 3 3 | 3 n | R | | 3 � � � � � � � � � � � � � � � � � p � � � � p p p p � � n | R | n n n n n 1 || | 2 || | 3 || | | | || | R R R R R
• Probabilistic Genetic Networks - PGN π π - Markov Chain ( , ) with the following axioms : | 0 Y X π a. is homogeneou s, p independs on t , Y | X y | x n > ∀ ∈ b. p 0 , x , y R y | x π c. é condiciona lly independen t, that is, Y | X n n ∏ ∀ ∈ = x , y R , p p ( y | x ), y | x i = i 1 n π ∀ ∈ d. almost - determinis tic, that is, x R e Y | X ∈ = ∈ ≈ i N { 1 ,.., n }, there is r R | p 1 , = y r | x i n ∀ ∈ ∀ ∈ e. x R , i N , there is a sub - space of << = dimension j , j n , such as : p p , , y | x y | x i i , wher e x is the projection of x on this sub - space
• Markov Chain � � � p p p p 1 | 1 2 | 1 3 | 1 n 3 | 1 � � � � � p p p p 1 | 2 2 | 2 3 | 2 n 3 | 2 � � � π = � p p p p � | 1 | 3 2 | 3 3 | 3 n Y X 3 | 3 � � � � � � � � � � � � p p p p � � n n n n n 1 | 3 2 | 3 3 | 3 3 | 3 • Probabilistic Genetic Networks - PGN � P , P , , P X | X X | X X | X 1 2 n Almost Deterministic � � � p p p p r | 1 r | 1 r | 1 r | 1 � � 1 2 3 l � � � p p p p r | 2 r | 2 r | 2 r | 2 1 2 3 l � � � = P p p p p � � X | X r | 3 r | 3 r | 3 r | 3 i 1 2 3 l � � � � � � � � � � � � p p p p � � n n n n r || R | r || R | r || R | r || R | 1 2 3 l
Time-Course Gene Expression Data Expression (Gene 1) time Expression (Gene 2) time Expression (Gene 3) time . . . . . . . . . . . . . . . . . . Expression (Gene n) time Expression Measurement Techniques .... x [ 1 ] x [ 2 ] x [ 3 ] x [ 4 ] x [ 5 ] x [ 6 ] x [ 7 ] x [ 9 ] x [ 10 ] x [ 11 ] x [ 12 ] x [ 13 ] x [ m − 1 ] x [ m ]
� = P ( X , Y ), j 1 , , n j Quantized Microarray at t � {( x [ 0 ], y [ 0 ]), ( x [ 1 ], y [ 1 ]), , ( x [ m ], y [ m ])} j j j � � [ ] x t 1 � � x [ t ] � � 2 = x [ t ] � � � � � � � x [ t ] n | A | ψ → : R j K U-curve algorithm Quantized j Values � ∈ = � ∈ x [ t ] R { r , r , , r }, r i 1 2 l i = + y [ t ] x [ t 1 ] j j � ⊂ A j { 1 , 2 , , n } Gene j quantized expression at t+1
Recommend
More recommend