16 11 04
play

16-11-04 Statistical Science and Data Science Nancy Reid 27 - PDF document

16-11-04 Statistical Science and Data Science Nancy Reid 27 October 2016 2 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Fisher Number Fisher Number Selected Correspondence of R. A. Fisher Edited by J.H. Bennett


  1. 16-11-04 Statistical Science and Data Science Nancy Reid 27 October 2016 2 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Fisher Number Fisher Number Selected Correspondence of R. A. Fisher Edited by J.H. Bennett “Do not forget to look up Walter Bodmer, who also has some experience being ‘bawled down’ by the Neymanians” 11 Jan 1962 3 4 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 1

  2. 16-11-04 “Some aspect of big data” Small data 1 Z ( η ) exp { a T v + b T h + v T Wh } , p ( v, h ; η ) ∝ = equations and = Big Machines formulas η = ( a, b, W ) = Lots of Computing = mathematical modelling = Complex Architectures = a little computing = Computer Science = Statistical Science 5 6 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Big Data Small Data So yesterday • Interesting • Detailed • Informative • Fun 7 8 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 2

  3. 16-11-04 Small Data 9 Fisher Memorial Lecture 27 Oct 2016 Big Data 2013 Big Data 2014 Gartner Hype Cycle 11 12 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 3

  4. 16-11-04 The push back 2015 Machine Learning 13 14 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 The push back The push back “if the assessment never asks about race, how could the algorithm throw up racially biased results?” “Credit scores are used by nearly half of American employers to screen potential employees” “Big data” has arrived, but big insights have not How big data threatens democracy and increases inequality 15 16 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 4

  5. 16-11-04 Canadian Institute for Statistical Sciences Fields Institute for Resesarch Pacific Institute in the for Mathematical Mathematical Sciences Sciences Centre de Recherches Mathématiques Workshops Opening Conference and Bootcamp • Opening Conference and Bootcamp Introduction to topics at following workshops • Statistical Machine Learning One day on each topic Many speakers started by trying to define big data • Optimization and Matrix Methods “I shall not today attempt further to define the kinds of • Visualization: Strategies and Principles material I understand to be embraced within that • Big Data in Health Policy shorthand description, and perhaps I could never succeed in intelligibly doing so. • Big Data for Social Policy But I know it when I see it … ” • Networks, Web mining, and Cyber-security • Statistical Theory for Large-scale Data • Challenges in Environmental Science Justice Potter Stewart; Jacobellis v. Ohio 22 June 1964 • Complex Spatio-temporal Data Robert Bell, Google, Plenary Opening Lecture • Commercial and Retail Banking 19 20 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 5

  6. 16-11-04 Some highlights Some highlights • Statistical Machine Learning • Statistical Machine Learning • Optimization • Visualization • Health Policy • Social Policy 21 22 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Statistical Machine Learning Restricted Boltzmann machine 1 Z ( η ) exp { a T v + b T h + v T Wh } 1 f ( v, h ; η ) ∝ Z ( η ) exp { a T v + b T h + v T Wh } η = ( a, b, W ) f ( v, h ; η ) ∝ • natural gradient ascent � ⌘ + ✏ i ( ⌘ ) − 1 r η ` ( ⌘ ; v, h ) ` = log f ⌘ i = E( − ` 00 ) • uses Fisher information as metric tensor Girolami and Calderhead (2011); Amari (1987); Rao (1945) • Gaussian graphical model approximation to force sparse inverse Grosse and Salakhutdinov (2016) 32 nd Internat. Conf. on Machine Learning 23 24 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 6

  7. 16-11-04 Restricted Boltzmann machine Restricted Boltzmann machine 1 Z ( η ) exp { a T v + b T h + v T Wh } f ( v, h ; η ) ∝ h | v • if just one binary top node, model for is a logistic regression h t | v, h − t • with several binary top nodes, model for is also a logistic regression, with odds ratio depending only on v • deep learning has ~10 layers, with millions of units in each layer Brendan Frey, Infinite Genomes Project • estimating parameters is an optimization problem Fields Live January 27 2015 Leung et al Bioinformatics 2014 25 26 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Some highlights Some highlights • Statistical Machine Learning • Optimization • Optimization n θ { 1 X • Visualization max log f ( y i | x i ; θ ) − P λ ( θ ) } n i =1 • Health Policy • Social Policy 27 28 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 7

  8. 16-11-04 Optimization Optimization n n θ { 1 θ { 1 X X max log f ( y i | x i ; θ ) − P λ ( θ ) } max log f ( y i | x i ; θ ) − P λ ( θ ) } n n i =1 i =1 • statistical error neighbourhood of true value ˆ θ − θ ∗ • lasso penalty P λ ( θ ) = λ || θ || 1 = λ Σ | θ j | • approximation error iterating over t θ t − ˆ θ • is convex relaxation of || θ || 0 Wainwright Fields Live Jan 16 2015 || θ || 1 Loh and Wainwright JMLR 2015 • many interesting penalties are non-convex • optimization routines may not find global optimum 29 30 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Some highlights Some highlights • Statistical Machine Learning • Optimization • Visualization • Visualization • Health Policy • Social Policy Innovis.cpsc.ucalgary.ca 31 32 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 8

  9. 16-11-04 Visualization Visualization KPMG Data Observatory, IC • statistical graphics – data representation – data exploration – filtering, sampling aggregation • information visualization • scientific visualization • cognitive science and design 33 34 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Visualization Visualization KPMG Data Observatory, IC fivethirtyeight.com 35 36 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 9

  10. 16-11-04 Visualization Visualization “The duty of beauty” New York Times fivethirtyeight.com 37 38 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Some highlights Some highlights • Statistical Machine Learning • Optimization • Visualization • Health Policy • Health Policy • Social Policy 39 40 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 10

  11. 16-11-04 Health Policy Administrative Databases Health Policy Administrative Databases Institute for Clinical and Evaluative Sciences Institute for Clinical and Evaluative Science 41 42 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Some highlights • Statistical Machine Learning • Optimization • Visualization • Health Policy • Social Policy Thérèse Stukel, ICES 44 Fisher Memorial Lecture 27 Oct 2016 11

  12. 16-11-04 Some highlights • Social Policy Thérèse Stukel, ICES 45 Fisher Memorial Lecture 27 Oct 2016 Privacy Some highlights • Statistical Machine Learning • “Big Data and Innovation, Setting the Record Straight: De-identification Does Work” • Optimization Privacy Commissioner of Ontario, July 2014 • Visualization • “No silver bullet: De-identification still doesn’t work” Narayan & Felten, July 2014 • Health Policy • Statistical Disclosure Limitation • Social Policy • Differential Privacy • Multi-party Communication • inference, environmental science, networks, genomics, finance, physical sciences, software infrastructure, … 47 48 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 12

  13. 16-11-04 What did we learn? What is data science? • a course? • Statistical models for big data are complex, high-dimensional • a set of courses? – inference is well-studied, but difficult • a job? • Computational challenges include size and speed – ideas of statistical inference get lost in the machine • a technology? • Data owners understand 2., but not 1. • a new field of research? • Data science may be the best way to combine these • a collaboration? 49 50 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Data Science Program(s) Data Science Research • data collection and data quality • mathematical reasoning • large N, small p • statistical theory – computational strategies, e.g. Spark, Hadoop – divide and conquer • statistical and machine learning methods • small n, large p • programming and software development – inferential and computational strategies – dimension reduction – post-selection inference • algorithms and data structure – inference for extremes • communication results and limitations • ‘new’ types of data: networks, graphs, text, images, … – “alternative sources” 51 52 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 13

Recommend


More recommend