Cancer gene discovery via network analysis of somatic mutation data Insuk Lee
Cancer is a progressive genetic disorder. • Accumulation of somatic mutations cause cancer. • For example, in colorectal cancer, the first gatekeeping mutation (often occur in APC) is followed by series of activation of oncogene and loss-of-function of tumor suppressor genes, which eventually generates a malignant tumor.
Sequencing approach to the comprehensive catalog of cancer genes • Tumor samples and adjacent healthy tissue (or blood) samples (i.e., matched normal) samples are sequenced (WES) and aligned to identify cancer-associated somatic mutations (and cancer genes). Nat. Rev. Genet 15:556 (2014)
Driver vs. Passenger mutations • Driver mutation: A mutation that directly or indirectly confers a selective growth advantage to the cell in which it occurs (opposite to passenger mutation) • Not all mutations are driver mutations. Therefore, not all genes contain somatic mutations are cancer driver genes. Nature 458:719 (2009)
Distinguishing Drivers from Passengers § Based on recurrent mutations • Use deleteriousness of the mutations
Using additional information to reduce false positives • Mutation frequency is normalized by gene-specific background mutation rate (BMR), expression level, and replication timing in Mutsig CV . Nature reviews genetics 15:556 (2014)
What about cancer genes with low mutation rate? Many hills but only few mountains Of the genomic landscapes of human colorectal cancers (Wood et al. Science 2007) • Map of mutations in 11 breast and 11 colorectal cancers. • In the landscape, the heights of the peaks reflect the mutation frequency of each gene. A few gene “ mountains ” are mutated in a large proportion of tumors: most genes are mutated in <5% of tumors and are represented as “ hills ” in the figure. • We observed similar distribution of mutation frequency from TCGA data.
Long-tail distribution of mutation frequency • The majority of the cancer genes are infrequently mutated and have somatic mutations in only few patients, which result in long-tail distribution of mutation frequency. • Therefore, methods based on recurrent mutations have intrinsic limitation in cancer gene identification. 2000 1800 2000 Among 422 known cancer 1800 1600 genes by CGC 1600 1400 7 genes: mut in >5% tumors 1400 Mutation count Mutation count 128 genes: mut in >1% tumors 1200 1200 12 genes: no mut in tumors 1000 1000 800 800 600 400 600 200 400 0 TP53 PIK3CA PTEN BRAF KMT2C KMT2D APC ATRX IDH1 ARID1A 200 0 Mutation distribution across 422 CGC (Cancer Genome Census) genes in 6764 Pan-cancer samples (April 2014 TCGA). 410 mutated genes
Cancer is a disease by pathway disorders • However, mutations concentrated in known cancer-related pathways, which suggest that pathway-centric approach will be useful in analysis of cancer genomics data. Nat. Rev. Cancer Poster (2002)
MUFFINN: mutations for functional impact on network neighbors • Predict driver genes based on pathway-level mutational information Genome Biology ( 2016 )
3 ways to take account neighbors’ mutational burden • On the following two functional gene networks Genome Res. (2011) Nucleic Acids Res. (2015)
Cancer gene sets for benchmarking prediction • No comprehensive gold standard cancer gene set • We compiled multiple cancer gene sets from various sources of annotations. • Each cancer gene set has a different trade-off between accuracy, coverage, and bias. CGC CGC PointMut 20/20 Rules HCD MouseMut • 422 genes • 118 genes • 124 genes • 288 genes • 797 genes • From CGC • CGC genes which • based on the • High-confidence Ortholog-mapped (Cancer Genome act to cancer via mutational patterns driver genes by genes which are Census) point mutations rule-based identified by approach mutagenesis experiment in mice V ogelstein et al. 2013 Futreal et al. 2004 Tamborero et al. 2013 March et al. 2011 Mann et al. 2012
Result 1: MUFFINN performs better than gene-based methods 18 cancer types ~6700 TCGA samples
Result 1: MUFFINN performs better than gene-based methods Evaluation based on the all candidates Evaluation based on the top candidates, which go into the follow-up studies
Testing significance of using mutational information among indirect network neighbors for MUFFINN Use mutation information Use mutation information of direct neighbors only of all genes
Result 2: MUFFINN can predict cancer drivers better with taking only direct neighbors’ mutational information. GS: Gaussian smoothing IR: Iterative Rank RWR: Random walk with restart
Result 3: The larger size of Pan-cancer data makes only marginal improvement in predictions.
Result 4: MUFFINN effectively predict cancer genes with only 10% of tumor samples.
Manual examination of the novel candidate drivers • Selected 199 novel candidate drivers that pass all the following criteria. 1. Predicted in top 1000 by MUFFINN (Prob > 0.5) 2. Predicted in top 1000 by neither Mutsig nor MutationAccessor 3. Annotated by neither CGC nor 20/20 cancer gene sets (to exclude all knowns) • Among 199 candidate cancer genes, 128 (64%) genes have direct or indirect supportive evidences in the literatures. • Class 1 (11 genes): already reported as cancer genes but not annotated yet by CGC or 20/20 database. • Class 2 (14 genes): known to increase cancer susceptibility through germline variants. • Class 3 (14 genes): known to be involved in cancer by copy number variation (CNV) or structural variation (SV). • Class 4 (89 genes): associated with cancer via expression dysregulation with non-genetic alterations (e.g., epigenetic regulation, miRNA target). • Class 5 (71 genes): with no evidence (novel candidates to be investigated in the future)
Novel candidate drivers with low mutation occurrence have neighboring genes known to be involved in cancer pathways
Performing prediction using a companion web server www.inetbio.org/muffinn
Summary Cancer genome sequencing can facilitate discovery of cancer driver § genes. We can distinguish drivers from passengers based on recurrent § mutations. Conventional methods based on recurrent mutations are intrinsically § limited to the cancer genes with low mutation occurrence. Since cancer is pathway disease, incorporating pathway information will § enhance cancer genomics data analysis. We developed a network-based method, MUFFINN, and a companion § web server, and demonstrated its superiority in cancer gene prediction. Network-based analysis of cancer genomics data will provide a § promising route to the comprehensive catalog of cancer gene.
Acknowledgements MUFFINN: cancer gene discovery via network analysis of somatic mutation data Genome Biology 17:129 ( June 2016 ) Yonsei Univeristy, Department of Biotechnology (Korea) Ara Cho, Jung Eun Shim, Eiru Kim EMBL-CRG Systems Biology Unit, Centre for Genomic Regulation (Spain) Ben Lehner, Fran Supek
Network Biology Lab (www.netbiolab.org) Current members Former members PhD. Jung Eun Shim Sangyoung Lee PhD. Sohyun Hwang PhD. Taeyun Oh PhD. Eiru Kim Chan Yeong Kim PhD. Jawon Song PhD. Samuel Beck Tak Lee Muyoung Lee PhD. Jonghoon Lee PhD. Yoonhee Ko Sunmo Yang Jaewon Cho PhD. Junha Shin PhD. Hanhae Kim Kyungsoo Kim Eunbeen Kim PhD. Ara Cho PhD. Sungou Ji Heonjong Han Hongseok Shim Hyojin Kim Dasom Bae
Result : Accounting for mutational heterogeneity is not important for MUFFINN.
HotNet2 vs. MUFFINN HotNet2 (Nat.Genet. 2015) 1. Assign heat (mutation) to each gene 2. Diffuse heat from hot (highly mutated) to cold genes in the network 3. Extract significantly hot subnetwork (cancer pathway) MUFFINN (this study) 1. Assign heat (mutation) to each gene 2. For each gene, measure mutational burden over network neighbors 3. Rank genes (cancer genes) by the mutational burden
Result : HotNet2 and MUFFINN are complementary Retrieval rate for known cancer genes in 144 candidates by HotNet2 and top 144 canddiates by MUFFINN Venn diagram among 422 CGC genes, 144 candidates by HotNet2, and top 144 candidates by MUFFINN
Recommend
More recommend