B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

An overview of problems 2-1

Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). 3-1

Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. 3-2

Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. i f p Frequent moments: F p = � i • F 0 : number of distinct items. • F 1 : total number of items. • F 2 : size of self-join. General F P ( p > 1), good measurements of the skewness of the data. 3-3

Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Included 0 . 01 m 1 2 3 4 5 6 7 8 | A | = m 4-1

Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Quantile: The φ -quantile of A is some x such Included that there are at most φ m items of A that are smaller than x and at most (1 − φ ) m items of A that are greater than x . 0 . 01 m All-quantile: a data structure from which all φ -quantiles for any 1 2 3 4 5 6 7 8 0 ≤ φ ≤ 1 can be extracted. | A | = m App: distribution of package sizes . . . 4-2

Statistics (cont.) L p sampling: Let x ∈ R n be a non-zero vector. For p > 0 we call the L p distribution corresponding to x the distribution on [ n ] that takes i with probability | x i | p , � x i � p p i ∈ [ n ] | x i | p ) 1 / p . In particular, for p = 0, the with � x � p = ( � L 0 sampling is to select an element uniform at random from the non-zero coordinates of x . App: an extremely useful tool for constructing graph sketches, finding duplications, etc. 5-1

Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. 6-1

Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). 6-2

Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph. App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves) 6-3

Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. 7-1

Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. Graph sparcification: Given a graph G = ( V , E ), denote the minimum cut of G by λ ( G ), and λ A ( G ) the capacity of the cut ( A , V \ A ). We say that a weighted subgraph H = ( V , E ′ , w ) is an ǫ -sparsification for G if ∀ A ⊂ V , (1 − ǫ ) λ A ( G ) ≤ λ A ( H ) ≤ (1 + ǫ ) λ A ( G ) . App: Synopses for massive graphs. A graph synopse is a subgraph of much smaller size that keeps properties of the original graph. 7-2

Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). 8-1

Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images 8-2

Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images Clustering: ( k -Center) Cluster a set of points X = ( x 1 , x 2 , . . . , x m ) to clusters c 1 , c 2 , . . . , c k with representatives r 1 ∈ c 1 , r 2 ∈ c 2 , . . . , r k ∈ c k to minimize max min d ( x i , r j ) i j . App: (see wiki page) 8-3

Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). 9-1

Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. 9-2

Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. Edit distance: Given two strings A and B , the number of insertion/deletion/substitution that is needed to convert A to B . App: a standard measurement of the similarity of two strings/documents 9-3

Numerical linear algebra Denote the stream by A = a 1 , a 2 , . . . , a n , where a k = ( i , j , ∆) denotes the update M [ i , j ] ← M [ i , j ] + ∆, where M [ i , j ] is the cell in the i -th row, j -th column of matrix M . 10-1

Numerical linear algebra Denote the stream by A = a 1 , a 2 , . . . , a n , where a k = ( i , j , ∆) denotes the update M [ i , j ] ← M [ i , j ] + ∆, where M [ i , j ] is the cell in the i -th row, j -th column of matrix M . Regression: Given an n × d matrix M and an n × 1 vector b , and one seeks x ∗ = argmin x � Mx − b � p , for a p ∈ [1 , ∞ ). 10-2

Numerical linear algebra Denote the stream by A = a 1 , a 2 , . . . , a n , where a k = ( i , j , ∆) denotes the update M [ i , j ] ← M [ i , j ] + ∆, where M [ i , j ] is the cell in the i -th row, j -th column of matrix M . Regression: Given an n × d matrix M and an n × 1 vector b , and one seeks x ∗ = argmin x � Mx − b � p , for a p ∈ [1 , ∞ ). Low-rank approximation: Given an n × m matrix M , find orthonormal n × k matrices L , W , and a diagonal � � M − LDW T � k × k ( k < min { n , m } ) matrix D with F minimized, � where �·� F is the Frobenius norm App: Fundamental problem in many areas, including machine learning, recommendation system, natural language processing, etc. 10-3

Sliding windows Sometimes we are only interested in recent items in the stream. RAM RAM Time-based sliding window w most recent time steps CPU Or, Sequence-based sliding window RAM w most recent items CPU 11-1

Lower bounds What is the impossible? Or, what is the limit of the space usage to solve a problem? Usually by reductions from communication complexity . (not for this course) 12-1

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe.

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Use R for Climate Research Research & Teaching Hurricane Climatology with R James B. Elsner

Time plots In applications like process control, detecting trends and other changes in a process

A quantile-based approach for hyperparameter transfer learning David Salinas 2 Huibin Shen 1

Stage Quantile regression by random projections Forecasting energy prices Involves

Sketching Streams Chris Taylor DoD Overview What-Why Sketch? Sketches Hyper Log Log

t ts

Reconstruction Chain Studies ORCA Osc. WG phone call, Bran Fearraigh, September 2019

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe.

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Use R for Climate Research Research &amp; Teaching Hurricane Climatology with R James B. Elsner

Time plots In applications like process control, detecting trends and other changes in a process

A quantile-based approach for hyperparameter transfer learning David Salinas 2 Huibin Shen 1

Stage Quantile regression by random projections Forecasting energy prices Involves

Sketching Streams Chris Taylor DoD Overview What-Why Sketch? Sketches Hyper Log Log

t ts

Reconstruction Chain Studies ORCA Osc. WG phone call, Bran Fearraigh, September 2019

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Use R for Climate Research Research & Teaching Hurricane Climatology with R James B. Elsner