compsci 514 algorithms for data science
play

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 9 0 logistics week, but not material next week (10/8 and 10/10). Process). feedback from you during the first 20 minutes of class.


  1. compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 9 0

  2. logistics week, but not material next week (10/8 and 10/10). Process). feedback from you during the first 20 minutes of class. make any adjustments and incorporate suggestions to help you learn the material better. 1 • Problem Set 2 was released on 9/28. Due Friday 10/11 . • Problem Set 1 should be graded by the end of this week. • Midterm on Thursday 10/17. Will cover material through this • This Thursday, will have a MAP (Midterm Assessment • Someone from the Center for Teaching & Learning will collect • Will be summarized and relayed to me anonymously, so I can

  3. summary Last Class: The Frequent Elements Problem and Misra-Gries summaries. similar to median trick. This Class: Randomized dimensionality reduction. random projection. 2 • Given a stream of items x 1 , . . . , x n and a parameter k , identify all elements that appear at least n / k times in the stream. • Deterministic algorithms: Boyer-Moore majority algorithm • Randomized algorithm: Count-Min sketch • Analysis via Markov’s inequality and repetition. ‘Min trick’ • The extremely powerful Johnson-Lindenstrauss Lemma and • Linear algebra warm up.

  4. high dimensional data ‘Big Data’ means not just many data points, but many measurements per data point. I.e., very high dimensional data. of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc... values. mutations and genetic markers. 3 • Twitter has 321 active monthly users. Records (tens of) thousands • A 3 minute Youtube clip with a resolution of 500 x 500 pixels at 15 frames/second with 3 color channels is a recording of ≥ 2 billion pixel values. Even a 500 x 500 pixel color image has 750 , 000 pixel • The human genome contains 3 billion+ base pairs. Genetic datasets often contain information on 100s of thousands+

  5. datasets as vectors and matrices In data analysis and machine learning, data points with many dimensional vectors, with real valued entries. Similarities/distance between have meaning for underlying datapoints. 4 attributes are often stored, processed, and interpreted as high vectors (e.g., ⟨ x , y ⟩ , ∥ x − y ∥ 2 )

  6. datasets as vectors and matrices Data points are interpreted as high dimensional vectors, with real valued entries. Dataset is interpreted as a matrix. 5 Data Points: x 1 , x 2 , . . . , x n ∈ R d Data Set: X ∈ R n × d with i th row equal to x i . Many data points n = ⇒ tall. Many dimensions d = ⇒ wide.

  7. dimensionality reduction Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions. ‘Lossy compression’ that still preserves important information about 6 x n ∈ R d ′ → for d ′ ≪ d . x 1 , x 2 , . . . , x n ∈ R d → ˜ x 1 , ˜ x 2 , . . . , ˜ the relationships between x 1 , . . . , x n . Generally will not consider directly how well ˜ x i approximates x i .

  8. dimensionality reduction Dimensionality reduction is a ubiquitous technique in data science. Compressing data makes it more efficient to work with. May also remove extraneous information/noise. 7 • Principal component analysis • Latent semantic analysis (LSA) • Linear discriminant analysis • Autoencoders

  9. matching entries in x A x B With large enough signature size r , J x A x B . • Reduce dimension from d U to r . Note: here J x A x B is a low distortion embedding embedding. But closely related. similarity rather than a distance, so not quire a low distortion r 8 Low Distortion Embedding: Given x 1 , . . . , x n ∈ R d , distance function x n ∈ R d ′ (where d ′ ≪ d ) and D , and error parameter ϵ ≥ 0, find ˜ x 1 , . . . ˜ distance function ˜ D such that for all i , j ∈ [ n ] : ( 1 − ϵ ) D ( x i , x j ) ≤ ˜ D (˜ x i , ˜ x j ) ≤ ( 1 + ϵ ) D ( x i , x j ) Have already seen one example in class: MinHash

  10. matching entries in x A x B With large enough signature size r , J x A x B . • Reduce dimension from d U to r . Note: here J x A x B is a low distortion embedding embedding. But closely related. similarity rather than a distance, so not quire a low distortion r 8 Low Distortion Embedding: Given x 1 , . . . , x n ∈ R d , distance function x n ∈ R d ′ (where d ′ ≪ d ) and D , and error parameter ϵ ≥ 0, find ˜ x 1 , . . . ˜ distance function ˜ D such that for all i , j ∈ [ n ] : ( 1 − ϵ ) D ( x i , x j ) ≤ ˜ D (˜ x i , ˜ x j ) ≤ ( 1 + ϵ ) D ( x i , x j ) Have already seen one example in class: MinHash

  11. low distortion embedding x B embedding. But closely related. similarity rather than a distance, so not quire a low distortion r 8 Low Distortion Embedding: Given x 1 , . . . , x n ∈ R d , distance function x n ∈ R d ′ (where d ′ ≪ d ) and D , and error parameter ϵ ≥ 0, find ˜ x 1 , . . . ˜ distance function ˜ D such that for all i , j ∈ [ n ] : ( 1 − ϵ ) D ( x i , x j ) ≤ ˜ D (˜ x i , ˜ x j ) ≤ ( 1 + ϵ ) D ( x i , x j ) Have already seen one example in class: MinHash With large enough signature size r , # matching entries in ˜ x A , ˜ ≈ J ( x A , x B ) . • Reduce dimension from d = | U | to r . Note: here J ( x A , x B ) is a

  12. embeddings for euclidean space Low Distortion Embedding for Euclidean Space: Given 9 x n ∈ R d ′ x 1 , . . . , x n ∈ R d and error parameter ϵ ≥ 0, find ˜ x 1 , . . . ˜ (where d ′ ≪ d ) such that for all i , j ∈ [ n ] : ( 1 − ϵ ) ∥ x i − x j ∥ 2 ≤ ∥ ˜ x i − ˜ x j ∥ 2 ≤ ( 1 + ϵ ) ∥ x i − x j ∥ 2 √∑ m Recall that for z ∈ R m , ∥ z ∥ 2 = i = 1 z ( i ) 2 .

  13. x n in place of x 1 x n in many applications: embeddings for euclidean space Low Distortion Embedding for Euclidean Space: Given Can use x 1 clustering, SVM, near neighbor search, etc. 10 x n ∈ R d ′ x 1 , . . . , x n ∈ R d and error parameter ϵ ≥ 0, find ˜ x 1 , . . . ˜ (where d ′ ≪ d ) such that for all i , j ∈ [ n ] : ( 1 − ϵ ) ∥ x i − x j ∥ 2 ≤ ∥ ˜ x i − ˜ x j ∥ 2 ≤ ( 1 + ϵ ) ∥ x i − x j ∥ 2

  14. embeddings for euclidean space Low Distortion Embedding for Euclidean Space: Given clustering, SVM, near neighbor search, etc. 10 x n ∈ R d ′ x 1 , . . . , x n ∈ R d and error parameter ϵ ≥ 0, find ˜ x 1 , . . . ˜ (where d ′ ≪ d ) such that for all i , j ∈ [ n ] : ( 1 − ϵ ) ∥ x i − x j ∥ 2 ≤ ∥ ˜ x i − ˜ x j ∥ 2 ≤ ( 1 + ϵ ) ∥ x i − x j ∥ 2 Can use ˜ x 1 , . . . , ˜ x n in place of x 1 , . . . , x n in many applications:

  15. embedding with assumptions 11 A very easy case: Assume that x 1 , . . . , x n all lie on the 1 st -axis in R d . Set d ′ = 1 and ˜ x i = x i ( 1 ) (i.e., ˜ x i is just a single number.). • For all i , j : √ [ x i ( 1 ) − x j ( 1 )] 2 = | x i ( 1 ) − x j ( 1 ) | = ∥ x i − x j ∥ 2 . ∥ ˜ x i − ˜ x j ∥ 2 = • An embedding with no distortion from any d into d ′ = 1.

  16. V T x i • For all i j , we have x i x j 2 x j 2 k be the • Let v 1 v 2 v k be an orthonormal basis for embedding with assumptions 2 x j x i v 1 and (a good exercise to show) k x i x j matrix with these vectors as its columns. d and V 12 An easy case: Assume that x 1 , . . . , x n lie in any k -dimensional subspace V of R d .

  17. embedding with assumptions matrix with these vectors as its columns. k 12 An easy case: Assume that x 1 , . . . , x n lie in any k -dimensional subspace V of R d . • Let v 1 , v 2 , . . . v k be an orthonormal basis for V and V ∈ R d × k be the • For all i , j , we have x i − x j ∈ V and (a good exercise to show) � � � ∑ ⟨ v ℓ , x i − x j ⟩ 2 = ∥ V T ( x i − x j ) ∥ 2 . ∥ x i − x j ∥ 2 = � ℓ = 1

  18. embedding with assumptions matrix with these vectors as its columns. k 13 An easy case: Assume that x 1 , . . . , x n lie in any k -dimensional subspace V of R d . • Let v 1 , v 2 , . . . v k be an orthonormal basis for V and V ∈ R d × k be the • For all i , j , we have x i − x j ∈ V and (a good exercise to show) � � ⟨ v ℓ , x i − x j ⟩ 2 = ∥ V T ( x i − x j ) ∥ 2 . � ∑ ∥ x i − x j ∥ 2 = � ℓ = 1 • If we set ˜ x i ∈ R k to ˜ x i = V T x i we have: ∥ ˜ x i − ˜ x j ∥ 2 = ∥ V T x i − V T x j ∥ 2 = ∥ V T ( x i − x j ) ∥ 2 = ∥ x i − x j ∥ 2 . • An embedding with no distortion from any d into d ′ = k . • V T : R d → R k is a linear map giving our dimension reduction.

  19. embedding with no assumptions What about when we don’t make any assumptions on d -dimensional space? 14 x 1 , . . . , x n . I.e., they can be scattered arbitrarily around • Can we find a no-distortion embedding into d ′ ≪ d dimensions? No! Require d ′ = d . • Can we find an ϵ -distortion embedding into d ′ ≪ d dimensions for ϵ > 0? Yes! Always, with d ′ depending on ϵ . For all i , j : ( 1 − ϵ ) ∥ x i − x j ∥ 2 ≤ ∥ ˜ x i − ˜ x j ∥ 2 ≤ ( 1 + ϵ ) ∥ x i − x j ∥ 2 .

Recommend


More recommend