Scalable Methods for the Analysis of Network-Based Data Principal Investigator: Professor Padhraic Smyth Department of Computer Science University of California Irvine Slides online at www.datalab.uci.edu/muri P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 1
Today’s Meeting • Goals – Review our research progress – Discussion, questions, interaction – Feedback from visitors Butts • Format – Introduction – Research talks • 25 minute slots • 5 mins at end for questions/discussion – Poster session from 1:15 to 2:45 – Question/discussion encouraged during talks – Several breaks for discussion P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 2
Motivation and Background P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 3
Motivation 2007: interdisciplinary interest in analysis of large network data sets Many of the available techniques are descriptive, cannot handle - Prediction - Missing data - Covariates, etc P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 4
Motivation 2007: interdisciplinary interest in 2007: significant statistical body of theory analysis of large network data sets available on network modeling Many of the available techniques do not scale up to large data sets, not widely known/understood/used, etc Many of the available techniques are descriptive, cannot handle - Prediction - Missing data - Covariates, etc P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 5
Motivation 2007: interdisciplinary interest in 2007: significant statistical body of theory analysis of large network data sets available on network modeling Goal of this MURI project Develop new statistical network models and algorithms to broaden their scope of Many of the available techniques do not application to large, complex, dynamic scale up to large data sets, not widely known/understood/used, etc real-world network data sets Many of the available techniques are descriptive, cannot handle - Prediction - Missing data - Covariates, etc P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 6
Project Dates • Project Timeline – Start date: May 1 2008 – End date: April 30 2011 (for 3-year award) • Meetings – Kickoff Meeting, November 2008 – Working Meeting, April 2009 – Working Meeting, August 2009 – Annual Review, December 2009 – Working Meeting, May 2010 – Annual Review, November 2010 P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 7
MURI Team Investigator University Department(s) Expertise Number Number of Of PhD Postdocs Students Padhraic Smyth (PI) UC Irvine Computer Science Machine learning 4 Carter Butts UC Irvine Sociology Statistical social 6 network analysis Mark Handcock UCLA Statistics Statistical social 1 1 network analysis Dave Hunter Penn State Statistics Computational 2 1 statistics David Eppstein UC Irvine Computer Science Graph algorithms 2 1 Michael Goodrich UC Irvine Computer Science Algorithms and 1 1 data structures Dave Mount U Maryland Computer Science Algorithms and 2 data structures TOTALS 18 4 P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 8
Collaboration Network Joe Maarten Simon Loffler Ryan Zack Chris Acton Almquist Darren Marcum Lowell Strash Trott Emma Lorien Sean Spiro Jasny Fitzhugh David Duy Vu Mike Eppstein Dave Goodrich Hunter Carter Michael Butts Schweinberger Ruth Padhraic Hummel Dave Smyth Mount Mark Handcock Eunhui Minkyoung Arthur Chris Park Cho Asuncion DuBois Miruna Petrescu-Prahova Nick Jimmy Ranran Wang Navaroli Foulds P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 9
Collaboration Network Ryan Joe Maarten Acton Simon Loffler Nicole Zack Chris Pierski Almquist Darren Marcum Lowell Strash Trott Emma Lorien Sean Spiro Jasny Fitzhugh David Duy Vu Mike Eppstein Dave Goodrich Hunter Carter Michael Butts Schweinberger Ruth Padhraic Hummel Dave Smyth Mount Mark Handcock Krista Gile Eunhui Minkyoung Arthur Chris Park Cho Asuncion DuBois Miruna Petrescu-Prahova Nick Jimmy Ranran Wang Navaroli Foulds Romain Thibaux P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 10
Data: Count matrix of 200,000 email messages among 3000 individuals over 3 months Problem : Understand communication pattterns and predict future communication activity Challenges: sparse data, missing data, non-stationarity, unseen covariates P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 11
Data: Inter-organizational communication patterns over time, post-Katrina Problem : understand the processes underlying network growth Challenge: noisy and sparse data, missing covariates P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 12
Key Scientific/Technical Challenges • Parametrize models in a sensible and computable way – Respect theories of social behavior as well as explain observed data, in a computationaly scalable manner • Account for real data – Understand sampling methods: account for missing, error-prone data • Make inference both principled and practical – Want accurate conclusions, but can’t wait forever for results • Deal with rich and dynamic data – Real-world problems involve systems with complex covariates (text, geography, etc) that change over time In sum: statistically principled methods that respect the realities of data and computational constraints P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 13
Mapping the Project Terrain Domain Theory Data Collection Statistical Models Statistical Theory P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 14
Mapping the Project Terrain Domain Theory Data Collection Statistical Models Statistical Theory Data Structures and Algorithms Estimation Algorithms P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 15
Mapping the Project Terrain Domain Theory Data Collection Statistical Models Statistical Theory Data Structures and Algorithms Estimation Algorithms Inference Hypothesis Prediction/ Decision Simulation Testing Forecasting Support P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 16
Summary of Accomplishments P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 17
Mapping the Project Terrain Domain Theory Data Collection Statistical Models Statistical Theory Data Structures and Algorithms Estimation Algorithms Inference Hypothesis Prediction/ Decision Simulation Testing Forecasting Support P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 18
Accomplishments: Theory and Methodology State of the Art State of the Art Potential Topic in 2008 now (with MURI) Applications And Impact General theory Problem only partially General statistical theory for Allows application of social for handling understood. treating missing data in a network modeling to data missing data in social network context. sets with significant missing social networks No software available Publicly-available code in R. data for statistical modeling (Gile and Handcock, 2010) Hidden/network No method for New principled methods for Potentially significant new population assessing sample quality assessing convergence. applications in areas such as sampling No method for sampling New multigraph sampling for criminology, epidemiology, with no well-connected non-connected networks etc network (Butts el al, 2010) Theory for Little theory for non- New method based on Tools for understanding of complex network Bernoulli models – “Bernoulli graph bounds” model properties will allow models knowledge based on (Butts, 2009) us to focus on better models approximate simulations P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 19
Accomplishments: Theory and Methodology State of the Art State of the Art Potential Topic in 2008 now (with MURI) Applications And Impact General theory Problem only partially General statistical theory for Allows application of social for handling understood. treating missing data in a network modeling to data missing data in social network context. sets with significant missing social networks No software available Publicly-available code in R. data for statistical modeling (Gile and Handcock, 2010) Hidden/network No method for New principled methods for Potentially significant new population assessing sample quality assessing convergence. applications in areas such as sampling No method for sampling New multigraph sampling for criminology, epidemiology, with no well-connected non-connected networks etc network (Butts el al, 2010) Theory for Little theory for non- New method based on Tools for understanding of complex network Bernoulli models – “Bernoulli graph bounds” model properties will allow models knowledge based on (Butts, 2009) us to focus on better models approximate simulations P. Smyth: Networks MURI Project Meeting, Nov 12 2010: 20
Recommend
More recommend