High-dimensional statistics and probability Christophe Giraud 1 , Matthieu Lerasle 2 , 3 and Tristan Mary-Huard 4 , 5 (1) Universit´ e Paris-Saclay (2) CNRS (3) ENSAE (4) AgroParistech (5) INRA - Le Moulon M2 Maths Al´ ea & MathSV C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 1 / 34
Informations on the course C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 2 / 34
Objective 1 To understand the main features of high-dimensional observations; 2 To learn the mains concepts and methods to handle the curse of dimensionality; 3 To get prepared for a PhD in statistics or machine learning 4 [MSV] Some biological illustrations by T. Mary-Huard. − → conceptual and mathematical course C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 3 / 34
Agenda (1/2) Structure The course has two parts Part 1 [MDA+MSV]: 7 weeks with C. Giraud: central concepts in the simple Gaussian setting Part 2 [MDA]: 7 weeks with M. Lerasle: essential probabilistic tools for stats and ML Part 2 [MSV]: 3 weeks with T. Mary-Huard: illustrations and supervised classification C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 4 / 34
Agenda (2/2) [MDA+MSV] 29/09 – 17/11 1 Curse of dimensionality + principle of model selection 2 Model selection theory 3 Information theoretic lower bounds 4 Convexification: principle and theory 5 Iterative algorithms 6 Low rank regression 7 False discoveries and multiple testing MSV (Tristan) MDA (Matthieu) 3 weeks with algorithmic aspects, 7 weeks on central probabilistic illustrations and supervised tools for ML and statistics classification C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 5 / 34
Organisation Organisation for the first part Lectures: the lectures will be recorded and displayed one week in advance on the Youtube channel https://www.youtube.com/channel/UCDo2g5DETs2s-GKu9-jT_BQ Lecture notes: lectures notes are available on the website of the course https://www.imo.universite-paris-saclay.fr/~giraud/Orsay/HDPS.html as well as handwritten notes for each lecture Exercises: the list of assigned exercises is given on the website Interactive sessions: every Tuesday at 10 am (room 1A7 or 1A14): a short recap, some time for questions, and discussions. Only half of you can come in person, the others will follow on the BBB channel https://bbb3.imo.universite-paris-saclay.fr/b/mas-nul-mln December 15: exam on the first part of the course ◮ 7 pt: on 1 or 2 exercises from the assigned list ◮ 13pt: research problem C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 6 / 34
Learn by doing you look actively at the recorded lectures: ◮ you try to understand all the explanations; ◮ if a point is not clear, press the pause button, try to understand, look at the lecture notes, release the pause button; ◮ when I ask questions: press the pause button, try to answer, release the pause button; ◮ do not forget coffee breaks ;-) you work out the lecture notes: take a pen and a sheet of paper, and redo all the computations. You have understood something, when you are able to ◮ explain it to someone else; ◮ answer the question ”why have we done that instead of anything else?” you work out the assigned exercises. you participate actively to the interactive sessions, either in person or with the BBB channel. C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 7 / 34
Documents Documents Lecture notes: pdf & printed versions, handwritten notes Website of the course https://www.imo.universite-paris-saclay.fr/~giraud/Orsay/HDPS.html Youtube channel https://www.youtube.com/channel/UCDo2g5DETs2s-GKu9-jT_BQ A wiki website for sharing solutions to the exercises http://high-dimensional-statistics.wikidot.com C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 8 / 34
Evaluation [MDA+MSV] Exam December 15 1 or 2 (part of) exercises of the list (7/20) ◮ list = those on the website https://www.imo.universite-paris-saclay.fr/~giraud/Orsay/HDPS.html a research problem (13/20) [MDA] second exam in late January mainly on the material presented by Matthieu Lerasle C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 9 / 34
Any questions so far? C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 10 / 34
High-dimensional data Chapter 1 C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 11 / 34
High-dimension data biotech data (sense thousands of features) images (millions of pixels / voxels) marketing, business data crowdsourcing data etc C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 12 / 34
Blessing? � we can sense thousands of variables on each ”individual” : potentially we will be able to scan every variables that may influence the phenomenon under study. � the curse of dimensionality : separating the signal from the noise is in general almost impossible in high-dimensional data and computations can rapidly exceed the available resources. C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 13 / 34
Curse of dimensionality Chapter 1 C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 14 / 34
Curse 1 : fluctuations cumulate Example : X (1) , . . . , X ( n ) ∈ R p i.i.d. with cov( X ) = σ 2 I p . We want to estimate E [ X ] with the sample mean n � X n = 1 ¯ X ( i ) . n i =1 Then p �� � 2 � � � X n − E [ X ] � 2 � � ¯ [ ¯ = X n ] j − E [ X j ] E E j =1 p � � � = p [ ¯ n σ 2 . = var X n ] j j =1 � It can be huge when p ≫ n ... C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 15 / 34
Curse 2 : locality is lost Observations ( Y i , X ( i ) ) ∈ R × [0 , 1] p for i = 1 , . . . , n . Model: Y i = f ( X ( i ) ) + ε i with f smooth. assume that ( Y i , X ( i ) ) i =1 ,..., n i.i.d. and that X ( i ) ∼ U ([0 , 1] p ) � � Y i : X ( i ) close to x Local averaging: � f ( x ) = average of C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 16 / 34
Curse 2 : locality is lost dimension = 2 dimension = 10 800 600 600 Frequency Frequency 400 400 200 200 0 0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 distance between points distance between points dimension = 100 dimension = 1000 800 800 600 600 Frequency Frequency 400 400 200 200 0 0 0 1 2 3 4 5 0 5 10 15 distance between points distance between points Figure: Histograms of the pairwise-distances between n = 100 points sampled uniformly in the hypercube [0 , 1] p , for p = 2 , 10 , 100 and 1000. C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 17 / 34
Why? Square distances. �� � 2 � � � X ( i ) − X ( j ) � 2 � p � � ( U − U ′ ) 2 � X ( i ) − X ( j ) E = E = p E = p / 6 , k k k =1 with U , U ′ two independent random variables with U [0 , 1] distribution. Standard deviation of the square distances � � �� � 2 � � � X ( i ) − X ( j ) � 2 � p � � � X ( i ) − X ( j ) = sdev var k k k =1 � p var [( U ′ − U ) 2 ] ≈ 0 . 2 √ p . = C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 18 / 34
Curse 3 : lost in high-dimensional spaces High-dimensional balls have a vanishing volume! volume Vp(1) 5 V p ( r ) = volume of a ball of radius r 4 in dimension p volume 3 = r p V p (1) 2 1 with 0 � 2 π e � p / 2 p →∞ 0 20 40 60 80 100 ( p π ) − 1 / 2 . V p (1) ∼ p p C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 19 / 34
Curse 3 : lost in high-dimensional space Which sample size to avoid the lost of locality? Number n of points x 1 , . . . , x n required for covering [0 , 1] p by the balls B ( x i , 1): � p � p / 2 √ p π 1 p →∞ n ≥ ∼ V p (1) 2 π e 20 30 50 100 200 p larger than the estimated 5 . 7 10 12 42 10 39 n 39 45630 number of particles in the observable universe C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 20 / 34
Curse 4: Thin tails concentrate the mass! Mass in the bell 1.0 0.8 mass in the bell 0.6 0.4 0.2 0.0 0 20 40 60 80 100 dimension p Figure: Mass of the standard Gaussian distribution g p ( x ) dx in the “bell” B p , 0 . 001 = { x ∈ R p : g p ( x ) ≥ 0 . 001 g p (0) } for increasing dimensions p . C. Giraud (Paris Saclay) High-dimensional statistics & probability M2 Maths Al´ ea & MathSV 21 / 34
Recommend
More recommend