pick up a handout on the front table
play

Pick up a handout on the front table 1 Welcome to DS504/CS586: - PowerPoint PPT Presentation

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH116 Fall 2017 Next Session: Final Project Presentation v 12/14 R 20 min each team (including


  1. Pick up a handout on the front table 1

  2. Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm –8:50pm R Location: KH116 Fall 2017

  3. Next Session: Final Project Presentation v 12/14 R 20 min each team (including Q&A) v Team 1 v Team 2 v Team 3 v Team 4 v Team 5 v Team 6 v Team 7 v v Snacks and soft drink will be provided. 3

  4. Today • 1. Review – Key topics, techniques, discussed in the semester – Future opportunities • Big data analytics • Urban Computing – 10 min break 7:20-7:30PM • 3. Team 1 presentation and discussion: 7:30PM • 4. Course evaluation 8:15PM-8:30PM • 5. Finish at 8:30PM – (last week we finished 18 minutes late.)

  5. Introduction What is “Big Data”? 5

  6. Big Data Analytics techniques and tools for managing, analyzing and extracting knowledge from “big data” 6

  7. CS586/DS504-2017Fall 5. Applications Techniques Sampling and index Urban Computing, Social Network Analysis Networking 1. Graph Mining 3. Index, Query 4. Big Data Mining 4. Data Collection Graph Mining, Data Clustering Recommender systems, Outlier Detection Clustering 4. K-means, DBSCAN 3. Data Management 4. BFR, DENCLUE Indexing, Query Processing 4. Trajectory Clustering 5. Urban: Bike sharing 2. Data Preprocessing/Cleaning Error Correction, Map-Matching More techniques 2. Map-Matching 1. Data Acquisition & Measurement 4. Recommender Systems 4. Outlier Detection Representative data collection: Sampling

  8. Big Data Mining Topics Topics in Big Data Mining 1 Graph Mining : 3 Recommender Systems Content-Based Graph Sampling Collaborative Filtering Node Importance Ranking User-User Based Facebook/Social graph estimation Item-Item Based Social influence Location-based recommender sys Topic sensitive PageRank Personalized Geo-Social Recom. 2 Clustering Hierarchical K-means, BFR 4. Outlier Detection DBScan, DENCLUE 5. Big Data Integration (Guest Lec.) Trajectory clustering

  9. Roadmap • 1. Sampling & Indexing – Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc • 2. Clustering – Hirachical – K-means, BFR, – DBScan, DENCLUDE • 3. Recommender System, Map-Matching, etc • 4. Applications – Social networks – Location based services – Urban computing, – and more

  10. Sampling Techniques to Count Population v German Tank Problem v Panther tanks, 1943. v World War II v Estimate # German Tanks ( N ) v the problem of estimating the maximum of a discrete uniform distribution from Sampling without replacement v m : the max series number v k : total number of tanks observed ˆ N = m (1 + k − 1 ) − 1 v Estimator: v the sample maximum plus the average gap between observations in the sample.

  11. Sampling Techniques to Count Population • Mark and recapture • a method commonly used in ecology to estimate an animal population’s size N . • Step 1: A portion of the population K is captured, marked, and released. • Step 2: Later, another portion n is captured and the number of marked individuals within the sample is counted k . N = Kn • Estimation: ˆ k

  12. Sampling Big Data 1.1 R andom sampling 1.2 c rawling (uniform & independent) } vertex sampling } BFS sampling } edge sampling } random walk sampling 12 12

  13. 1.1 Random Vertex Sampling & Index • One-dimension Data – YouTube: Random Prefix Sampling – Index structure: B-Tree, List Index • Two Dimension Data (Spatial Data) – Google map/Foursquare: Random Region Sampling/Random Region Zoom-in – Index structure: Grid-based / Quad Tree / R-Tree • Three Dimension Data (spatio-temporal data) – Trajectory sampling: Random index sampling – Index structure (combinations): B-Tree+Quad-tree, 3-D R-tree

  14. Full B-Tree Structure

  15. Grid-based Spatial Indexing • Indexing – Partition the space into disjoint and uniform grids – Build an index between each grid and the points in the grid g2 g1 g1 p1 p3 p1 p4 p3 g2 p4

  16. Quad-Tree • Indexing – Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example). 00 0 1 0 1 2 3 03 02 30 00 30 31 2 3 33 32 16

  17. Random Vertex Sampling: YouTube Comments from other YouTube users

  18. YouTube Video ID Space

  19. Random Prefix Sampling • Let p L denote the probability that a randomly generated id matches a given L-length prefix p L = 1/|S| L =1/64 L , if L=1,…,10 p L = 1/(|S| 10 |T|)=1/(64 10 *16), if L=11 • Generate m prefixes of length L. • Let X iL be the total number of videos with a prefix i of length L , and N the total number of videos then, X iL ~ Binomial( N, p L );

  20. Unbiased Estimator for the Total Number of Videos • Given m samples X iL by querying randomly generated prefixes of the same length in [1,11], we have the unbiased estimator of total number of videos m 1 ˆ ∑ L N = X i mp L (See paper for the confidence interval and variance) i = 1

  21. Practical Issues

  22. Simple random region sampling Tabulating Stage: Estimation Unbiased estimator of the total number of venues N : : Number of venues of X t ; Please refer to the paper for proof of the unbiasedness, confidence interval, and estimator design of other statistics.

  23. Random Region Zoom-in on Maps • RRZI( A ): At each step, RRZI divides the current queried region into two sub-regions and randomly selects a non-empty sub-region to zoom-in when it contains more than k PoIs ( k =5) Probability of sampling the sub-region Step 1 Step 2 Step 3 Step 4 23

  24. Random Region Zoom-in on Maps • RRZI and RRZIC can be viewed as weighted sampling methods. Estimators of sum and distribution aggregates: sampled sub-regsions r 1 ,…, r m probability of sampling the sub-region r i 24

  25. Motivation & Problem Definition q covers n index leaf nodes How to sample B index leaf nodes to estimate # of trajectories in q with a guaranteed error bound?

  26. Random Index Sampling Sampling and Estimation B Sampled index leaf nodes Trajectory list Occurrence time k q 1 , k q r 1 , r 2 2 k q 3 , k q r 3 , r 5 r 3 r 5 5 r 1 r 2 … k q 6 , k q r 6 , r 7 7 r 6 r 7 r 6 r 9 k q 9 , k q r 9 , r 10 10 … … Lat Time r 1 Index leaf node list r 2 Index leaf node list Lng r 3 Index leaf node list query q … … Inverted index ST-indexed data Data Indexing Structure

  27. Random Index Sampling • Stage 1: Sampling Stage: • Uniformly at random sample B index leaf nodes with replacement • Stage 2: Estimation Stage: (Unbiased Estimator) • Convergence analysis: when , . is the maximum number of trajectories in an index leaf node.

  28. 1.2 Crawling based Sampling Undirected !! 2 3 1 6 4 5

  29. (1) Breadth-First-Search (BFS) • Starting from a seed, explores all neighbor nodes. Process continues F iteratively without replacement. G E H C • BFS leads to bias towards high D B degree nodes A Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006 Unexplored • Early measurement studies of Explored OSNs use BFS as primary sampling technique Visited i.e [Mislove et al], [Ahn et al], [Wilson et al.] Minas Gjoka, UC Irvine Walking in Facebook 29

  30. Random Walk • Adjacency matrix 1 2 ! $ ! $ 3 0 0 0 0 1 1 1 # & # & 1 0 1 0 0 2 0 0 # & # & Symmetric D = A = # & # & 0 0 3 0 1 1 0 1 # & # & 0 0 0 2 1 0 1 0 " % " % 4 3 • Transition Probability Matrix Undirected ij = 1 " % 0 1/ 3 1/ 3 1/ 3 P $ ' k i 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' $ ' 1/ 3 1/ 3 0 1/ 3 $ ' 1/ 2 0 1/ 2 0 # & • |E|: number of links • Stationary Distribution π i = d i 2 E

  31. Metropolis-Hastings Random Walk • Adjacency matrix 1 2 ! $ ! $ 3 0 0 0 0 1 1 1 # & # & 1 0 1 0 0 2 0 0 # & # & Symmetric D = A = # & # & 0 0 3 0 1 1 0 1 # & # & 0 0 0 2 1 0 1 0 " % " % 4 3 • Transition Probability Matrix Undirected ì 1 min(1, k u u ) if neighbor of w ï " % ï 0 1/ 3 1/ 3 1/ 3 k k $ ' = í MH P u w 1/ 3 1/ 3 1/ 3 0 P = A • D − 1 = $ ' u , w î å ï - u MH 1 P if = w $ ' 1/ 3 1/ 3 0 1/ 3 u , y ï $ ' 1/ 3 0 1/ 3 1/ 3 ¹ u y # & • |E|: number of links • Stationary Distribution 1 p = u V

  32. 2. Clustering • 1. Hierarchical • 2. K-means -> BFR • 3. DBScan -> DENCLUDE

  33. Example: Hierarchical clustering (5,3) o (1,2) x (1.5,1.5) o x (4.7,1.3) x (1,1) x (4.5,0.5) o (2,1) o (4,1) o (0,0) o (5,0) Data: o … data point x … centroid Dendrogram

  34. Example: K-means x x x x x x x x x x x x … data point … centroid Clusters after round 1 J. Leskovec, A. 34 Rajaraman, J.

  35. Example: K-means x x x x x x x x x x x x … data point … centroid Clusters after round 2 J. Leskovec, A. 35 Rajaraman, J.

Recommend


More recommend