A Parallel Approximation Hitting Set Algorithm for Gene Expression - PDF document

A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis D. P. Ruchkys Universidade de S˜ ao Paulo S. W. Song Universidade de S˜ ao Paulo 1

Gene Expression Analysis • Given an experiment where expression levels of thousands of genes are measures. • We consider the problem of determining which genes affect the expression level of a given gene. 2

Our Problem • Given an experiment with n genes of a set E = { a 0 , a 1 , ..., a n − 1 } whose expression levels are measured in a time series of m measures (typically n >> m ). We have a total of nm values of 0’s or 1’s. • Our algorithm (based on Ideker et al. [ITK00]) receives an m × n matrix of such values and determine, for a given gene a n − 1 , which other genes are responsible for the expression level of a n − 1 . • Example. x 0 x 1 x 2 x 3 1 1 1 0 p 0 - 1 0 1 p 1 1 - 0 0 p 2 M = 1 1 - 1 p 3 1 1 1 + p 4 3

Example of Execution of the Algorithm Infer the truth table for a 3 of the matrix E shown. x 0 x 1 x 2 x 3 1 1 1 0 p 0 - 1 0 1 p 1 1 - 0 0 p 2 M = 1 1 - 1 p 3 1 1 1 + p 4 (1) In step (1), the expression levels of a 3 differ in the row pairs (0,1), (0,3), (1,2) and (2,3). We find: • for (0,1), S 01 = { a 0 , a 2 } , containing all the other genes whose expression levels also differ in the row pairs p 0 and p 1 . • the same is done for (0,3), S 03 = { a 2 } . • for (1,2), S 12 = { a 0 , a 1 } . • for (2,3), S 23 = { a 1 } . 4

Result of Step 1 Result of Step 1: S 01 = { a 0 , a 2 } , S 03 = { a 2 } , S 12 = { a 0 , a 1 } , S 23 = { a 1 } . (2) In Step (2), find S min = { a 1 , a 2 } , the smallest set such that each element in S min is also present in each one of the sets S ij of the previous step. 5

The Hitting Set Problem • Given a finite set E , a finite collection S = { S 1 , ..., S w } of subsets of E , find a subset A ⊆ E of the smallest size, such that A ∩ S i � = ∅ for all i = 1 , ..., w . 6

The Hitting Set Problem E 2 5 8 7 3 1 9 4 6 S 7 4 3 5 5 1 1 6 1 9 6 A 1 4 7

The Hitting Set Problem Primal-Dual Approximation Algorithm [FMCF01] • Due to Bar-Yehuda and Even [BYE81] and was originally conceived for the minimum set cover problem. • It is an α -approximation algorithm, where α = max w i =1 | S i | . • α = max w i =1 | S i | = O ( n ). 8

The Hitting Set Problem Greedy Approximation Algorithm [J74] • Strategy of constructing the set A by choosing the elements that occurs the most times in the subsets of S . • The approximation ratio is ln |S| + 1. • ln |S| + 1 = O (log m 2 ) 9

The Hitting Set Problem Greedy Approximation Algorithm E 2 5 8 7 3 1 9 4 6 S 7 4 3 5 5 1 1 6 1 9 6 A 10

The Hitting Set Problem Greedy Approximation Algorithm E 2 5 8 7 3 1 9 4 6 S 7 4 3 5 5 1 1 6 1 9 6 A 1 11

The Hitting Set Problem Greedy Approximation Algorithm E a 0 a 1 a 2 a 3 S a 0 a 0 a 1 a 2 a 1 a 2 A a 1 a 2 12

The Sequential Algorithm gene vector occurrence list 1 0 0 1 1 0 2 3 set vector j1 i1 covered list 2 false 0 0 0 1 1 2 3 x 0 x 1 x 2 x 3 1 1 1 0 p 0 p 1 - 1 0 1 1 - 0 0 p 2 M = 1 1 - 1 p 3 1 1 1 + p 4 13

The Sequential Algorithm gene vector occurrence list 2 2 0 0 2 3 1 2 1 2 0 2 0 3 set vector i1 j1 covered list false 2 0 0 1 0 false 1 0 2 3 false 1 2 0 2 1 false 3 3 1 2 x 0 x 1 x 2 x 3 1 1 1 0 p 0 p 1 - 1 0 1 1 - 0 0 p 2 M = 1 1 - 1 p 3 1 1 1 + p 4 14

The Sequential Algorithm gene vector occurrence list 2 2 0 0 0 2 1 3 1 2 2 1 1 0 2 0 3 HS:{0} set vector i1 j1 covered list false 2 0 0 0 1 true 1 0 false 2 3 false 1 2 0 2 1 true 3 3 1 2 false 15

The Sequential Algorithm Time and Space Complexities • To construct the data structures: O ( m 2 n ). • Let k the size of the hitting set. We have to find k times the element with the largest number of occurrences. Therefore we have the time complexity of O ( kn ). • For each such element, we have to update the data structures: O ( m 2 n ) time. Since we have k elements, the total time complexity to update data structures is O ( km 2 n ). 16

The Sequential Algorithm Time and Space Complexities • The total time complexity is therefore O ( m 2 n )+ O ( kn )+ O ( km 2 n ) = O ( km 2 n ) . • The size k of the hitting set is O ( m 2 ). Therefore, the time complexity of the algorithm can be expressed as O ( m 4 n ). • The space complexity is O ( m 2 n ). 17

The Parallel Algorithm • The input matrix M is partitioned vertically to be stored in each processor. • Example of the partitioning: a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 0 a 1 x 0 , 2 x 0 , 3 x 0 , 4 x 0 , 5 x 0 , 6 x 0 , 7 x 0 , 8 x 0 , 0 x 0 , 1 x 1 , 8 x 1 , 0 x 1 , 1 x 1 , 2 x 1 , 3 x 1 , 4 x 1 , 5 x 1 , 6 x 1 , 7 M = x 2 , 2 x 2 , 3 x 2 , 4 x 2 , 5 x 2 , 6 x 2 , 7 x 2 , 8 x 2 , 0 x 2 , 1 x 3 , 2 x 3 , 3 x 3 , 4 x 3 , 5 x 3 , 6 x 3 , 7 x 3 , 8 x 3 , 0 x 3 , 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . x m − 1 , 4 x m − 1 , 5 x m − 1 , 6 x m − 1 , 0 x m − 1 , 1 x m − 1 , 2 x m − 1 , 3 x m − 1 , 7 x m − 1 processor 2 processor 1 processor 0 18

The Parallel Algorithm • Each processor reads a piece of the input of size m × n − 1 p . • All the processors store a vector v , corresponding to the expression levels of the gene under study a n − 1 . • Each processor p i also stores a gene vector , with information about genes it is responsible for. The gene vector stores information of the genes for which processor p i is responsible. The gene vector in each processor has size O ( m 2 n p ). • Each processor also has a set vector , such that only elements of set S ij of its responsibility will only be in the list. 19

The Parallel Algorithm • Example: x 0 x 1 x 2 x 3 x 4 1 1 1 0 0 p 0 - 1 0 0 1 p 1 1 - 0 0 0 p 2 E = 1 1 - 0 1 p 3 1 1 1 0 + p 4 gene vector gene vector ocurrence list ocurrence list 2 0 2 2 0 1 0 2 2 0 1 2 3 3 set vector set vector i1 j1 list i1 j1 covered list covered 0 0 0 1 false 0 0 1 false 2 1 1 3 3 0 false 0 false 2 2 2 2 2 1 false 0 1 1 false 3 3 2 3 false 1 2 3 false Processor 0 Processor 1 20

The Parallel Algorithm Time and Space Complexities • Time complexity: O ( m 4 n p ). • Requires O ( k ) communication rounds, where k is the size of the hitting set. It can be expressed in terms of m , O ( m 2 ). • Requires O ( m 2 n p ) space. 21

0.05 ⋄ ◦ 20x1024 0.04 • 20x2048 ⋄ 20x4096 Seconds 0.03 • ⋄ 0.02 • ⋄ ◦ 0.01 ⋄ • ◦ • ◦ ◦ 0 2 4 6 8 No. Processors 22

Bibliographical References [BYE81] R. Bar-Yehuda and S. Even. A linear time approximation algorithm for the weighted vertex cover problem. Journal of Algorithms , 2:198-203, 1981. [FMCF01] C. G. Fernandes, F. K. Miyazawa, M. Cerioli, P. Feofiloff. Uma introdu¸ c˜ ao sucinta a algoritmos de aproxima¸ c˜ ao. 23 Col´ oquio Brasileiro de Matem´ atica , 2001. [ITK00] T. E. Ideker, V. Thorsson, R. Karp. Discovery of regulatory interactions through perturbation: inference and experimental design. Pacific Symposium on Biocomputing , 5:302-313, 2000. [J74] D. S. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences , 9:256-278, 1974. 23

A Parallel Approximation Hitting Set Algorithm for Gene Expression - PDF document

A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis D. P. Ruchkys Universidade de S ao Paulo S. W. Song Universidade de S ao Paulo 1 Gene Expression Analysis Given an experiment where expression levels of

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

6. Approximation and fitting norm approximation least-norm problems regularized

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Gene-gene and gene-environment interactions in genetic case- control association studies Jurg Ott

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype

Gene finding Lorenzo Cerutti Swiss Institute of Bioinformatics EMBNet course, September 2002

Family-based analysis of genome-wide gene gene interactions Marit Ackermann Biotec TU Dresden

Boolean models of gene regulatory networks Matthew Macauley Math 4500: Mathematical Modeling

Comparison of commonly used methods for combining multiple phylogenetic data sets Anne Kupczok,

Introduction to Microarray Data Analysis and Gene Networks Lecture 3 and practical Alvis Brazma

AI AI Department of Computer Science University of Calgary CPSC 565 Winter 2003 Emergent

Explosive Condensation in a One-dimensional Particle System Bartek Waclaw and Martin R. Evans

On Construction of Probabilistic Boolean Networks Wai-Ki CHING Advanced Modeling and Applied

Topics for today Introduction to Bioconductor: Getting started with Bioconductor g Using R

Structure determination of genomes and genomic domains by satisfaction of spatial restraints

Extracting correlation structure from large random matrices Alfred Hero University of Michigan -