U NIVERSITY OF C ALIFORNIA , S ANTA B ARBARA AND T HE N ATIONAL S CIENCE F OUNDATION I NTEGRATIVE G RADUATE E DUCATION AND R ESEARCH T RAINING P ROGRAM IN N ETWORK S CIENCE 1
O UTLINE Overview of Network Science 1. My research in Network Science 2. IGERT Network Science at UCSB 3. My background and experiences 4. Questions? 5. 2
W HAT ARE N ETWORKS ? 3
N ETWORK S CIENCE Challenges of Big Data Networks Analysis Modeling Two significant paradigm shifts ‘Big Data’ -driven discovery ranging from biological and engineering to social sciences and psychology Holistic study of systems interconnections between individual units, the network, affect the behavior of a system much more than the individual components. 4
M Y R ESEARCH IN N ETWORK S CIENCE Fast Clustering Methods for Genetic Mapping in collaboration with: Aydin Buluc, Leonid Oliker, Joseph Gonzalez, Stefanie Jegelka, Jarrod Chapman, Daniel Rokhsar, John Gilbert 5
C LUSTERING Finding groups of similar/highly connected vertices in a network 𝐷 1 𝐷 2 cluster 𝐷 3 6
G ENETIC M APPING O VERVIEW A genetic map is a list of genetic markers ordered according to their co-segregation patterns marker3 Chromosme 1 marker1 Chromosome 2 marker 4 marker2 Chromosome 3 7
G ENETIC M APPING O VERVIEW A genetic map is a list of genetic elements ordered according to their co-segregation patterns Genetic Map marker 4 marker3 marker1 marker2 marker3 Chromosme 1 marker1 Chromosome 2 marker 4 Linkage Group 3 Linkage Group 1 Linkage Group 2 marker2 Chromosome 3 8
G ENETIC M APPING O VERVIEW The problem of genetic mapping can essentially be divided into three parts: (1) grouping, (2) ordering, and (3) spacing.
G ENETIC M APPING : (1) G ROUPING P HASE Data 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 1 A B - - A - A B A A B A 𝑛 2 𝑀𝐻 1 𝑛 3 A A - - - B 𝑀𝐻 2 A - B - B B 𝑛 4 model as 𝑛 5 B - B A - A a network cluster A A B A - - 𝑛 6 𝑛 7 - - - A B B A B A B - A 𝑛 8 𝑛 9 A B - B - - B B B - A A 𝑛 10 𝑛 11 A A A A B B B - A B A - 𝑛 12 𝑛 13 B B - A A - - - - B A A 𝑛 14 𝑛 15 B - - A A B (missing data) 10
P ROBLEMS IN L ARGE S CALE G ENETIC M APPING State-of-the- art mapping tools don’t scale well Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers Bottleneck is the linkage-group-finding phase Popular mapping tools all handle this phase the same way, with an 𝑃(𝑛 2 ) clustering algorithm for 𝑛 markers Our solution: A fast , scalable clustering algorithm tailored to genetic marker data 11
O UR A PPROACH : B UBBLE C LUSTER A LGORITHM Assume: clusters have a “linear” structure 12
O UR A PPROACH : B UBBLE C LUSTER A LGORITHM Assume: clusters have a “linear” structure threshold distance representative points Key idea: Maintain a set of representative points per cluster, the union of which “spans” a entire cluster 13
B UBBLE C LUSTER A LGORITHM 𝑒(𝑛, 𝑠 𝑘 ) m 𝑠 𝑘 𝐷 1 𝐷 2 Iteration i : find 𝑠 𝑁𝐽𝑂 ∶= 𝑠 𝑘 for which 𝑒(𝑛, 𝑠 𝑘 ) is minimal; set 𝐷 𝑁𝐽𝑂 ≔ 𝐷 𝐿 ∈ 𝐷 containing 𝑠 𝑁𝐽𝑂
B UBBLE C LUSTER A LGORITHM m 𝑠 𝑁𝐽𝑂 𝐷 1 𝐷 2 If ( d 𝑛, 𝑠 𝑁𝐽𝑂 > 𝒖𝒊𝒔𝒇𝒕𝒊𝒑𝒎𝒆 )
B UBBLE C LUSTER A LGORITHM m 𝐷 3 𝐷 1 𝐷 2 If ( 𝑒 𝑛, 𝑠 𝑁𝐽𝑂 > 𝒖𝒊𝒔𝒇𝒕𝒊𝒑𝒎𝒆 ) 𝐷 = 𝐷 ∪ {𝑛}
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑁𝐵𝑌 = 𝐷 1 𝐷 2 𝑠 𝑁𝐽𝑂 Else If ( IS_INTERIOR 𝑠 𝑁𝐽𝑂 )
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑁𝐵𝑌 = 𝐷 1 𝐷 2 𝑠 𝑁𝐽𝑂 Else If ( IS_INTERIOR 𝑠 𝑁𝐽𝑂 ) 𝐷 𝑁𝐽𝑂 = 𝐷 𝑁𝐽𝑂 ∪ {𝑛}
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑁𝐵𝑌 = 𝐷 1 𝐷 2 𝑠 𝑁𝐽𝑂 Else If ( IS_EXTERIOR 𝑛, 𝑠 𝑁𝐽𝑂 )
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑁𝐵𝑌 = 𝐷 1 𝐷 2 𝑠 𝑁𝐽𝑂 Else If ( IS_EXTERIOR 𝑛, 𝑠 𝑁𝐽𝑂 ) Add 𝑛 to representative points of 𝐷 𝑁𝐽𝑂 Add 𝑛 to 𝐷 𝑁𝐽𝑂
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑁𝐵𝑌 = 𝐷 1 𝐷 2 𝑠 𝑁𝐽𝑂 Else // 𝑛 is interior to the outer point 𝑠 𝑁𝐽𝑂
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑁𝐽𝑂 = 𝐷 1 𝐷 2 𝑠 𝑁𝐽𝑂 Else Add 𝑛 to 𝐷 𝑁𝐽𝑂
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑁𝐽𝑂 = 𝐷 1 𝐷 2 If the marker has a distance below the threshold to two clusters
Our Approach: “ Lod score bubbles” algorithm m 𝐷 𝑂𝐹𝑋 = 𝐷 1 ∪ 𝐷 2 If the marker has a distance below the threshold to two clusters, merge the clusters and add m to the merged cluster
Evaluation Metric for Cluster Quality F-score range: 0 to 1 Given a “golden standard clustering”, the F -score measures the quality of another clustering by comparing it to the golden standard The F-score is a harmonic mean of precision and recall An F-score of 1 indicates perfect precision and perfect recall for EVERY golden standard cluster
Data Real dataset: switchgrass ~100,000 markers Jarrod Chapman at JGI provided us with his clustering results Simulated data: 𝑇𝑄𝐵𝐻𝐼𝐹𝑈𝑈𝐽 [1] In our simulations, marker numbers varied from 1,000 to 100,000 markers Missing data experiments performed on 100,000 marker dataset, with missing data rate varying from 5% to 50% Error rate was ~1% [1] SPAGHETTI: Simulation Software to Test Genetic Mapping Programs. Nicholas A. Tinker
Results: Our Algorithm Applied to Real Data Switchgrass dataset ~113,000 markers ~37% missing data No existing mapping tools could handle this much data “golden standard clusters” are those provided by Jarrod Chapman Overall F-score: 0.989806
Results: Clustering Algorithm Comparison Simulated Data Clustering 12.5 K Markers 25 K Markers F-Score Time F-score Time JoinMap 0.99964 14 min 0.99982 46 min MSTMap 0.99964 4.5 min 0.99982 20 min BubbleCluster 0.99944 6 sec 0.99972 15 sec
Conclusions By exploiting the structure underlying genetic marker clusters, we were able to design a fast algorithm tailored to genetic data While remaining highly accurate, we outperform existing mapping tools in runtime and scalability I think this is a good example of an interdisciplinary project in network science!
IGERT P ROGRAM IN N ETWORK S CIENCE Prepare students to engineer and control large networks measure and predict the dynamics design algorithms to operate at high scales make such networks robust Growing demand from science, commerce, and national security analysis of gene networks to find new therapies intervention strategies in social networks to counter the spread of misinformation discovery of clandestine terrorism activity 31
IGERT P ROGRAM IN N ETWORK S CIENCE Funded by the NSF (2013-2018) Recruit 5-7 students per year for 4 years Total of 25 students First cohort begins in Fall 2014 Fellowship for first two years $90,890 financial package for CA-residents $105,992 financial package for non-residents (tuition) Fellowships include a $30,000 stipend Departmental RA/TA or campus fellowship for remaining years Students enter any of these seven departments Communication Geography Computer Science Mechanical Engineering Ecology, Evolution, and Marine Biology Sociology Electrical and Computer Engineering 32
UCSB GRADUATE EDUCATION COSTS Education Costs Other Expenses Tuition $12,192 Personal Expenses $1,543 Loan Fees $122 Campus Based Fees $800 Rent $13,468 Health Insurance $2,453 Utilities $431 Non-Resident Tuition Fee $14,694 Transportation $1,239 Additional Non-Resident $408 Education Fee Telephone/Cell Phone $287 Total $30,547 Books and Supplies $1,444 Food $2,560 * Source is the UCSB Graduate Division Website. These numbers, totaling $51,641, are used to determine Total $21,094 financial aid for incoming graduate students. IGERT fellowships cover 100% of the Education Costs, and provide an additional annual stipend of $30,000 for the Other Expenses. Departmental support after first two years (RA or TA) Provides monthly salary and covers Education Costs. Salaries vary by department and research experience. 33
Recommend
More recommend