Automated Gene Classification using Nonnegative Matrix Factorization - PowerPoint PPT Presentation

Automated Gene Classification using Nonnegative Matrix Factorization on Biomedical Literature Kevin Heinrich PhD Dissertation Defense Department of Computer Science, UTK March 19, 2007

2/1 Kevin Heinrich Using NMF for Gene Classification

What Is The Problem? Understanding functional gene relationships requires expert knowledge. Gene sequence analysis does not necessarily imply function. Gene structure analysis is difficult. Issue of scale. Biologists know a small subset of genes. Thousands of genes. Time & Money. 4/1 Kevin Heinrich Using NMF for Gene Classification

Defining Functional Gene Relationships Direct Relationships. Known gene relationships (e.g. A-B). Based on term co-occurrence. 1 Indirect Relationships. Unknown gene relationships (e.g. A-C). Based on semantic structure. 1 Jenssen et al., Nature Genetics , 28:21, 2001. 5/1 Kevin Heinrich Using NMF for Gene Classification

Semantic Gene Organizer Gene information is compiled in human-curated databases. Medical Literature, Analysis, and Retrieval System Online (MEDLINE) EntrezGene (LocusLink) Medical Subject Heading (MeSH) Gene Ontology (GO) Gene documents are formed by taking titles and abstracts from MEDLINE citations cross-referenced in the Mouse, Rat, and Human EntrezGene entries for that gene. Examines literature (phenotype) instead of genotype. Can be used as a guide for future gene exploration. 6/1 Kevin Heinrich Using NMF for Gene Classification

Vector Space Model Gene documents are parsed into tokens . Tokens are assigned a weight, w ij , of i th token in j th document. An m × n term-by-document matrix, A , is created. A = [ w ij ] Genes are m -dimensional vectors. Tokens are n -dimensional vectors. 8/1 Kevin Heinrich Using NMF for Gene Classification

Term-by-Document Matrix . . . d 1 d 2 d 3 d n t 1 w 11 w 12 w 13 w 1 n t 2 w 21 w 22 w 23 w 2 n t 3 w 31 w 32 w 33 w 3 n t 4 w 41 w 42 w 43 w 4 n . ... . . t m w m 1 w m 2 w m 3 w mn Typically, a term-document matrix is sparse and unstructured. 9/1 Kevin Heinrich Using NMF for Gene Classification

Weighting Schemes Term weights are the product of a local, global component, and document normalization factor. w ij = l ij g i d j The log-entropy weighting scheme is used where l ij = log 2 (1 + f ij )  � ( p ij log 2 p ij )  f ij j = 1 +  , p ij = g i   � log 2 n f ij  j 10/1 Kevin Heinrich Using NMF for Gene Classification

Latent Semantic Indexing (LSI) LSI performs a truncated singular value decomposition (SVD) on M into three factor matrices A = U Σ V T U is the m × r matrix of eigenvectors of AA T V T is the r × n matrix of eigenvectors of A T A Σ is the r × r diagonal matrix of the r nonnegative singular values of A r is the rank of A 11/1 Kevin Heinrich Using NMF for Gene Classification

SVD Properties A rank- k approximation is generated by truncating the first k column of each matrix, i.e., A k = U k Σ k V T k A k is the closest of all rank- k approximations, i.e., � A − A k � F ≤ � A − B � for any rank- k matrix B 12/1 Kevin Heinrich Using NMF for Gene Classification

SVD Querying Document-to-Document Similarity k A k = ( V k Σ k ) ( V k Σ k ) T A T Term-to-Term Similarity k = ( U k Σ k ) ( U k Σ k ) T A k A T Document-to-Term Similarity A k = U k Σ k V T k 13/1 Kevin Heinrich Using NMF for Gene Classification

Advantages of LSI A is sparse, factor matrices are dense. This causes improved recall for concept-based matching. Scaled document vectors can be computed once and stored for quick retrieval. Components of factor matrices represent concepts. Decreasing number of dimensions compares documents in a broader sense and achieves better compression. Similar word usage patterns get mapped to same geometric space. Genes are compared at a concept level rather than a simple term co-occurrence level resulting in vocabulary independent comparisons. 14/1 Kevin Heinrich Using NMF for Gene Classification

Presentation of Results Problem: Biologists are familiar with interpreting trees. LSI produces ranked lists of related terms/documents. Solution: Generate pairwise distance data, i.e., 1 − cos θ ij Apply distance-based tree-building algorithm Fitch - O ( n 4 ) NJ - O ( n 3 ) FastME - O ( n 2 ) 16/1 Kevin Heinrich Using NMF for Gene Classification

Defining Functional Gene Relationships on Test Data 17/1 Kevin Heinrich Using NMF for Gene Classification

“Problems” with LSI Initial term weights are nonnegative; SVD introduces negative components. Dimensions of factored space do not have an immediate interpretation. Want advantages of factored/reduced dimension space, but want to interpret dimensions for clustering/labeling trees. Issue of scale—understand small collections better rather than huge collections. 19/1 Kevin Heinrich Using NMF for Gene Classification

Defining Functional Gene Relationships Direct Relationships. Known gene relationships (e.g. A-B). Based on term co-occurrence. 2 Indirect Relationships. Unknown gene relationships (e.g. A-C). Based on semantic structure. b b Label Relationships (e.g. x & y). y b x b b b A B C 2 Jenssen et al., Nature Genetics , 28:21, 2001. 20/1 Kevin Heinrich Using NMF for Gene Classification

NMF Problem Definition Given nonnegative V , find W and H such that V ≈ WH W , H ≥ 0 W has size m × k H has size k × n 21/1 Kevin Heinrich Using NMF for Gene Classification

NMF Problem Definition Given nonnegative V , find W and H such that V ≈ WH W , H ≥ 0 W has size m × k H has size k × n W and H are not unique. i.e., WDD − 1 H for any invertible nonnegative D 22/1 Kevin Heinrich Using NMF for Gene Classification

NMF Interpretation V ≈ WH Columns of W are k “feature” or “basis” vectors; represent semantic concepts. Columns of H are linear combinations of feature vectors to approximate corresponding column in V . Choice of k determines accuracy and quality of basis vectors. Ultimately produces a “parts-based” representation of the original space. 23/1 Kevin Heinrich Using NMF for Gene Classification

k n H W k m 24/1 Kevin Heinrich Using NMF for Gene Classification

k 3 0.1 ✲ 8 2.2 9 0.7 n H W k m 25/1 Kevin Heinrich Using NMF for Gene Classification

k 3 0.1 8 2.2 ✲ 9 0.7 ✌ n cerebrovascular disturbance microcephaly H spectroscopy neuromuscular W k m 26/1 Kevin Heinrich Using NMF for Gene Classification

Euclidean Distance (Cost Function) � 2 � E ( W , H ) = � V − WH � 2 � F = V ij − ( WH ) ij i , j Minimize E ( W , H ) subject to W , H ≥ 0. E ( W , H ) ≥ 0. E ( W , H ) = 0 if and only if V = WH . � V − WH � convex in W or H separately, not both simultaneously. No guarantee to find global minima. 27/1 Kevin Heinrich Using NMF for Gene Classification

Initialization Methods Since NMF is an iterative algorithm, W and H must be initialized. Random positive entries. Structured initialization typically speeds convergence. Run k -means on V . Choose representative vector from each cluster to form W and H . Most methods do not provide static starting point. 28/1 Kevin Heinrich Using NMF for Gene Classification

Non-Negative Double SVD NNDSVD is one way to provide a static starting point. 3 k σ j u j v T Observe A k = � j , i.e. sum of rank-1 matrices j =1 Foreach j Compute C = u j v T j Set to 0 all negative elements of C Compute maximum singular triplet of C , i.e., [ˆ u , ˆ s , ˆ v ] Set j th column of W to ˆ u and j th row of H to σ j ˆ s ˆ v Resulting W and H are influenced by SVD. 3 Boutsidis & Gallopoulos, Tech Report, 2005 29/1 Kevin Heinrich Using NMF for Gene Classification

NNDSVD Variations Zero elements remain “locked” during MM update. NNDSVDz keeps zero elements. NNDSVDe assigns ǫ = 10 − 9 to zero elements. NNDSVDa assigns average value of A to zero elements. 30/1 Kevin Heinrich Using NMF for Gene Classification

Update Rules Update rules should decrease the approximation. maintain nonnegativity constraints. maintain other constraints imposed by the application (smoothness/sparsity). 31/1 Kevin Heinrich Using NMF for Gene Classification

Multiplicative Method (MM) � W T V � cj H cj ← H cj ( W T WH ) cj + ǫ � VH T � ic W ic ← W ic ( WHH T ) ic + ǫ ǫ ensures numerical stability. Lee and Seung proved MM non-increasing under Euclidean cost function. Most implementations update H and W “simultaneously.” 32/1 Kevin Heinrich Using NMF for Gene Classification

Other Objective Functions � V − WH � 2 F + α J 1 ( W ) + β J 2 ( H ) α and β are parameters to control level of additional constraints. 33/1 Kevin Heinrich Using NMF for Gene Classification

Smoothing Update Rules For example, set J 2 ( H ) = � H � 2 F to enforce smoothness on H to try to force uniqueness on W . 4 � W T V � cj − β H cj H cj ← H cj ( W T WH ) cj + ǫ � VH T � ic − α W ic W ic ← W ic ( WHH T ) ic + ǫ 4 Piper et. al., AMOS, 2004 34/1 Kevin Heinrich Using NMF for Gene Classification

Automated Gene Classification using Nonnegative Matrix Factorization - PowerPoint PPT Presentation

Automated Gene Classification using Nonnegative Matrix Factorization on Biomedical Literature Kevin Heinrich PhD Dissertation Defense Department of Computer Science, UTK March 19, 2007 2/1 Kevin Heinrich Using NMF for Gene Classification

Nonnegative matrix factorization and applications in audio signal processing C edric F

Fast Newton-type Methods for Nonnegative Matrix and Tensor Approximation Inderjit S. Dhillon

Robust nonnegative matrix factorisation with the -divergence and applications in imaging C

Some Recent Advances in Nonnegative Matrix Factorization and their Applications to Hyperspectral

Automatic relevance determination in nonnegative matrix factorization with the -divergence

Nonnegative Matrix Factorization and Applications Christine De Mol (joint work with Michel

Robust nonnegative matrix factorisation with the -divergence and applications in imaging C

Parallel Nonnegative Matrix Factorization Algorithms for Hyperspectral Images A Masters Thesis

Adversarial Nonnegative Matrix Factorization Lei Luo, Yanfu Zhang, Heng Huang Electrical and

Sparse Separable Nonnegative Matrix Factorization Extending Separable NMF with 0 sparsity

Age and Gender Recognition from Speech Patterns Based on Supervised NonNegative Matrix

New variants of Nonnegative Matrix Factorization for sparsity improvement and maximum biclique

Neural Nonnegative Matrix Factorization for Hierarchical Multilayer Topic Modeling Jamie Haddock

Data Sciences CentraleSupelec Advance Machine Learning Course VI - Nonnegative matrix

Visualization for Classification ROC, AUC, Confusion Matrix Mahdi Roozbahani Lecturer,

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 1 Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 1 Instructor: Yizhou Sun

Multi-View Clustering via Joint Nonnegative Matrix Factorization Jialu Liu 1 Chi Wang 1 Jing Gao 2

An Empirical Comparison of Automated Generation and Classification Techniques for

Supervised classification and outliers detection in gene expression data Laurent Br eh elin

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun

An introduction to Nonnegative Matrix Factorisation Slim ESSID Telecom ParisTech June 2015 Slim

Accurate Eigenvalues and SVDs of Totally Nonnegative Matrices Plamen Koev San Jose State