compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 12 0
logistics guide/practice questions. Tuesday and also before class at 10:00am . 1 • Problem Set 2 is due this upcoming Sunday 3/8 at 8pm. • Midterm is next Thursday, 3/12. See webpage for study • I will hold office hours after class today . • Next week office hours will be at the usual time after class
summary Last Class: Finished Up Johnson-Lindenstrauss Lemma support vector machines and k -means clustering. This Class: High-Dimensional Geometry 2 • Completed the proof of the Distributional JL lemma. • Showed two applications of random projection: faster • Started discussion of high-dimensional geometry. • Bizarre phemomena in high-dimensional space. • Connections to JL lemma and random projection.
orthogonal vectors What is the largest set of mutually orthogonal unit vectors in What is the largest set of unit vectors in d -dimensional space nearly pairwise orthogonal with high probability! 3 d -dimensional space? Answer: d . that have all pairwise dot products |⟨ ⃗ x ,⃗ y ⟩| ≤ ϵ ? (think ϵ = . 01) Answer: 2 Θ( ϵ 2 d ) . In fact, an exponentially large set of random vectors will be
4 d . nearly orthogonal. 2 Claim: 2 Θ( ϵ 2 d ) random d -dimensional unit vectors will have all pairwise dot products |⟨ ⃗ x ,⃗ y ⟩| ≤ ϵ (be nearly orthogonal). Proof: Let ⃗ x 1 , . . . ,⃗ √ x t each have independent random entries set to ± 1 / • What is ∥ ⃗ x i ∥ 2 ? Every ⃗ x i is always a unit vector. • What is E [ ⟨ ⃗ x i ,⃗ x j ⟩ ] ? E [ ⟨ ⃗ x i ,⃗ x j ⟩ ] = 0 • By a Chernoff bound, Pr [ |⟨ ⃗ x i ,⃗ x j ⟩| ≥ ϵ ] ≤ 2 e − ϵ 2 d / 6 . 2 e ϵ 2 d / 12 , using a union bound over all • If we chose t = 1 8 e ϵ 2 d / 6 possible pairs, with probability ≥ 3 / 4 all will be ( t ) ≤ 1
curse of dimensionality Even with an exponential number of random vector samples, have a huge amount of data. high-dimensional space – samples are very ‘sparse’ unless we Curse of dimensionality for sampling/learning functions in clustering useless. we don’t see any nearby vectors. x T 5 Up Shot: In d -dimensional space, a set of 2 Θ( ϵ 2 d ) random unit vectors have all pairwise dot products at most ϵ (think ϵ = . 01) ∥ ⃗ x i − ⃗ 2 = ∥ ⃗ 2 + ∥ ⃗ 2 − 2 ⃗ i ⃗ x j ∥ 2 x i ∥ 2 x j ∥ 2 x j ≥ 1 . 98 . • Can make methods like nearest neighbor classification or • Only hope is if we lots of structure (which we typically do...)
curse of dimensionality Distances for Random Images: model for actual input data. Another Interpretation: Tells us that random data can be a very bad Distances for MNIST Digits: 6 10 7 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 10 7 10 5 5 5 10 10 10 15 15 15 20 20 20 8 25 25 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 6 5 5 5 10 10 10 15 15 15 20 20 20 4 25 25 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 5 5 2 10 10 10 15 15 15 20 20 20 25 25 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 -0.5 0 0.5 1 1.5 2 2.5
connection to dimensionality reduction Recall: The Johnson Lindenstrauss lemma states that if exercise to partially work through. x n x 1 then 7 , log n ( ) Π ∈ R m × d is a random matrix (linear map) with m = O ϵ 2 for ⃗ x 1 , . . . ,⃗ x n ∈ R d with high probability, for all i , j : ( 1 − ϵ ) ∥ ⃗ x i − ⃗ 2 ≤ ∥ Π ⃗ x i − Π ⃗ 2 ≤ ( 1 + ϵ ) ∥ ⃗ x i − ⃗ x j ∥ 2 x j ∥ 2 x j ∥ 2 2 . Implies: If ⃗ x 1 , . . . ,⃗ x n are nearly orthogonal unit vectors in d -dimensions (with pairwise dot products bounded by ϵ/ 8), Π ⃗ Π ⃗ x 1 ∥ 2 , . . . , ∥ Π ⃗ ∥ Π ⃗ x n ∥ 2 are nearly orthogonal unit vectors in m -dimensions (with pairwise dot products bounded by ϵ ). • Similar to SVM analysis. Algebra is a bit messy but a good
connection to dimensionality reduction Claim 1: n nearly orthogonal unit vectors can be projected to after projection to a much lower dimensional space. d -dimensional space still holds on the n points in question up to constants. orthogonal vectors. 8 dimensions and still be nearly orthogonal. log n ( ) m = O ϵ 2 Claim 2: In m dimensions, there are at most 2 O ( ϵ 2 m ) nearly • For both these to hold it might be that n ≤ 2 O ( ϵ 2 m ) . • 2 O ( ϵ 2 m ) = 2 O ( log n ) ≥ n . Tells us that the JL lemma is optimal • m is chosen just large enough so that the odd geometry of
bizarre shape of high-dimensional balls small in the dimension d ! Volume of a radius R ball is d 2 9 Let B d be the unit ball in d dimensions. B d = { x ∈ R d : ∥ x ∥ 2 ≤ 1 } . What percentage of the volume of B d falls within ϵ distance of its surface? Answer: all but a ( 1 − ϵ ) d ≤ e − ϵ d fraction. Exponentially π ( d / 2 )! · R d .
bizarre shape of high-dimensional balls area/volume ratio of any shape. nearly all will fall near its surface. 10 All but an e − ϵ d fraction of a unit ball’s volume is within ϵ of its surface. If we randomly sample points with ∥ x ∥ 2 ≤ 1, nearly all will have ∥ x ∥ 2 ≥ 1 − ϵ . • Isoperimetric inequality : the ball has the maximum surface • If we randomly sample points from any high-dimensional shape, • ‘All points are outliers.’
bizarre shape of high-dimensional balls What fraction of the cubes are visible on the surface of the cube? 10 3 1000 11 10 3 − 8 3 = 1000 − 512 = . 488 .
12 bizarre shape of high-dimensional balls What percentage of the volume of B d falls within ϵ distance of its equator? Answer: all but a 2 Θ( − ϵ 2 d ) fraction. Formally: volume of set S = { x ∈ B d : | x ( 1 ) | ≤ ϵ } . By symmetry, all but a 2 Θ( − ϵ 2 d ) fraction of the volume falls within ϵ of any equator! S = { x ∈ B d : |⟨ x , t ⟩| ≤ ϵ }
bizarre shape of high-dimensional balls How is this possible? High-dimensional space looks nothing like this picture! 13 Claim 1: All but a 2 Θ( − ϵ 2 d ) fraction of the volume of a ball falls within ϵ of any equator. Claim 2: All but a 2 Θ( − ϵ d ) fraction falls within ϵ of its surface.
concentration of volume at equator x is selected uniformly at random from the surface of the ball. x 14 Proof Sketch: Claim: All but a 2 Θ( − ϵ 2 d ) fraction of the volume of a ball falls within ϵ of its equator. I.e., in S = { x ∈ B d : | x ( 1 ) | ≤ ϵ } . • Let x have independent Gaussian N ( 0 , 1 ) entries and let ¯ ∥ x ∥ 2 . ¯ x = x ( 1 ) | > ϵ ] ≤ 2 Θ( − ϵ 2 d ) . Why? • Suffices to show that Pr [ | ¯ x ( 1 ) = x ( 1 ) • ¯ ∥ x ∥ 2 . What is E [ ∥ x ∥ 2 2 ] ? E [ ∥ x ∥ 2 2 ] = ∑ d i = 1 E [ x ( i ) 2 ] = d . 2 ≤ d / 2 ] ≤ 2 − Θ( d ) Pr [ ∥ x ∥ 2 • Conditioning on ∥ x ∥ 2 2 ≥ d / 2, since x ( 1 ) is normally distributed, Pr [ | ¯ x ( 1 ) | > ϵ ] = Pr [ | x ( 1 ) | > ϵ · ∥ x ∥ 2 ] d / 2 ] = 2 Θ( − ( ϵ √ d / 2 ) 2 ) = 2 Θ( − ϵ 2 d ) . √ ≤ Pr [ | x ( 1 ) | > ϵ ·
high-dimensional cubes In low-dimensions, the cube is not that different from the ball. d 2 1 huge gap! So something is very different about these shapes... 15 Let C d be the d -dimensional cube: C d = { x ∈ R d : | x ( i ) | ≤ 1 ∀ i } . π But volume of C d is 2 d while volume of B d is ( d / 2 )! = d Θ( d ) . A
high-dimensional cubes Corners of cube are d times further away from the origin than the surface of the ball. 16 √
high-dimensional cubes these corners lie far outside the unit ball. 17 Data generated from the ball B d will behave very differently than data generated from the cube C d . • x ∼ B d has ∥ x ∥ 2 2 ≤ 1. 2 ≤ d / 6 ] ≤ 2 − Θ( d ) . • x ∼ C d has E [ ∥ x ∥ 2 2 ] = ? d / 3, and Pr [ ∥ x ∥ 2 • Almost all the volume of the unit cube falls in its corners, and
takaways low-dimensional space. lower-dimensional space that is still large enough to capture this behavior on a subset of n points. high-dimensional vectors. in high-dimensions. 18 • High-dimensional space behaves very differently from • Random projection (i.e., the JL Lemma) reduces to a much • Need to be careful when using low-dimensional intuition for • Need to be careful when modeling data as random vectors
Recommend
More recommend