CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 CS535 BIG DATA PART B. GEAR SESSIONS SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS435 Introduction to Big Data - Spring 2016 FAQs • Your disk quota is 20GB (per student) • If you need more space, please let me know ASAP http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Topics of Todays Class • Part 1: Counting Triangles (from the last lecture) • Part 2: Locality Sensitive Hashing CS435 Introduction to Big Data - Spring 2016 GEAR Session 4. Large Scale Recommendation Systems and Social Media Lecture 4. Social Network Analysis Counting Triangles http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: Probability for Random Graphs • If we start with n nodes and add m edges to a graph at random, there will be an expected number of triangles in the graph • There are ! 3 sets of three nodes • Approximately n 3 /6 sets of three nodes that might be a triangle • The probability of an edge between any two given nodes being added • m/ ! 2 • approximately 2m/n 2 • The probability that any set of three nodes has edges between each pair • if those edges are independently chosen to be present or absent • Approximately (2m/n 2 ) 3 = 8m 3 /n 6 • Thus, the expected number of triangles in a graph of n nodes and m randomly selected edges • Approximately ( 8m 3 /n 6 )(n 3 /6) = 4 (m/n) 3 CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: How about Social Network Graph? • If a graph is a social network graph, • n nodes ( n users) • m edges (with m pairs of friends) • Do we expect the number of triangle to be, a. Same b. Much greater than the value for a random graph Much smaller than the value for a random graph c. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: How about Social Network Graph? • If a graph is a social network graph, • n nodes ( n users) • m edges (with m pairs of friends) • Do we expect the number of triangle to be, a. Same b. Much greater than the value for a random graph Much smaller than the value for a random graph c. Why? If A and B are friends, and A is also a friend of C, there should be a much greater chance than average that B and C are also friends Counting the number of triangles helps us to measure the extent to which a graph looks like a social network CS435 Introduction to Big Data - Spring 2016 What Else with the Counting Triangles? • Counting the number of triangles helps us to measure the extent to which a graph looks like a social network • It also shows some characteristics of social networks • E.g. the age of a community is related to the density of triangles http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 An Algorithm for Finding Triangles • Suppose we have a graph of n nodes and m ( ≥ n) edges. For convenience, assume the nodes are integers 1, 2, . . . , n • Heavy hitter • If the degree is at least ! • Heavy hitter triangle • Triangle all three of whose nodes are heavy hitters • Note that the number of heavy hitter nodes is no more than 2 ! • Otherwise, the sum of the degrees of the heavy hitter nodes would be more than 2m • Each edge contributes to the degree of only two nodes CS435 Introduction to Big Data - Spring 2016 1. Preparing for the Data Structures • Step 1 . Compute the degree of each node • Examine each edge and add 1 to the count of each of its two nodes • The total time required is O(m) • Step 2. Create an index on edges, with the pair of nodes at its ends as the key. • For the given two nodes, whether the edge between them exists • A hash table suffices • O(m) • Expected time to answer a query about the existence of an edge is a constant • Step 3. Create another index of edges, this one with key equal to a single node • Given a node v , we can retrieve the nodes adjacent to v in time proportional to the number of those nodes http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 2. Sorting Nodes • Sorting nodes • First criteria : By degree • Second criteria: if v and u have the same degree, recall that both v and u are integers, so order them numerically • Therefore, we say v ≺ u if and only if either 1) The degree of v is less than the degree of u , or 2) The degrees of u and v are the same, and v < u CS435 Introduction to Big Data - Spring 2016 3. Counting Triangles [1/2] • Heavy-Hitter Triangles • There are only O( ! ) heavy-hitter nodes • There are O(m 3/2 ) possible heavy-hitter triangles, and using the index on edges we can check if all three edges exist in O(1) time. Therefore, O(m 3/2 ) time is needed to find all the heavy-hitter triangles • Other Triangles • Consider each edge (v 1 ,v 2 ) • If both v 1 and v 2 are heavy hitters, ignore this edge • Suppose that v 1 is not a heavy hitter and moreover v 1 ≺ v 2 • Let u 1 , u 2 , . . . , u k be the nodes adjacent to v 1 • Note that k < √m • We can find these nodes, using the index on nodes, in O(k) time, which is surely O( √ m) time http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 3. Counting Triangles [2/2] • Other Triangles-continued • For each u i we can use the first index to check whether edge ( u i ,v 2 ) exists in O(1) time • We can also determine the degree of u i in O(1) time, because we have counted all the nodes’ degrees • We count the triangle { v 1 ,v 2 ,u i } if and only if the edge ( u i ,v 2 ) exists, and v 1 ≺ u i • A triangle is counted only once • v 1 is the node of the triangle that precedes both other nodes of the triangle according to the ≺ ordering • Time to process all the nodes adjacent to v 1 is O( √ m) • Since there are m edges, the total time spent counting other triangles is O(m 3/2 ) • The time to find heavy hitter triangles is O(m 3/2 ) and so is the time to find the other triangles • Thus, the total time of the algorithm is O(m 3/2 ) CS435 Introduction to Big Data - Spring 2016 GEAR Session 5. Algorithmic Techniques for Big Data Lecture 1. Locality Sensitive Hashing Introduction http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Traditional hash functions • Cryptographic hash function (e.g. SHA-1) • Should be difficult to reverse • Designed to map a data to an integer that can be used to look in a particular bucket within the hash table (e.g. hashtables) • Key properties for the non-cryptographic hash functions • Efficiently computable • Should uniformly distribute the keys • Two inputs will result in hash outputs that are either different or the same based on key properties of the inputs. CS435 Introduction to Big Data - Spring 2016 Locality-sensitive hash functions • Hash value collisions are more likely for two input values • “Close” together than for inputs that are far apart • Many different definitions regarding the “closeness” • Neighboring • Similarity • … http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8
CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Finding the most similar documents • Measuring similarity between pairs of documents • Extremely expensive • Example • 1M documents, signatures of length 250 ( 4 Byte each ) • 1M x 1,000Bytes = 1GB • Number of comparisons = 1M C 2 ( Half of trillion pairs) • 1 ms per calculation of similarity • 6 days to complete computing • Do we need to calculate the similarity for all of the pairs? CS435 Introduction to Big Data - Spring 2016 Distance measure • A distance measure over a space is a function d(x,y) that takes two points in the space as arguments and produces a real number that satisfies the following axioms: d(x,y) ≥ 0 (no negative distance) 1. d(x,y) = 0 if and only if x = y 2. d(x,y) = d(y,x) (distance is symmetric) 3. d(x,y) ≤ d(x,z)+d(z,y) (the triangle inequality) 4. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9
Recommend
More recommend