FAQs Your disk quota is 20GB (per student) If you need more space, - PDF document

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 CS535 BIG DATA PART B. GEAR SESSIONS SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS435 Introduction to Big Data - Spring 2016 FAQs • Your disk quota is 20GB (per student) • If you need more space, please let me know ASAP http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Topics of Todays Class • Part 1: Counting Triangles (from the last lecture) • Part 2: Locality Sensitive Hashing CS435 Introduction to Big Data - Spring 2016 GEAR Session 4. Large Scale Recommendation Systems and Social Media Lecture 4. Social Network Analysis Counting Triangles http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: Probability for Random Graphs • If we start with n nodes and add m edges to a graph at random, there will be an expected number of triangles in the graph • There are ! 3 sets of three nodes • Approximately n 3 /6 sets of three nodes that might be a triangle • The probability of an edge between any two given nodes being added • m/ ! 2 • approximately 2m/n 2 • The probability that any set of three nodes has edges between each pair • if those edges are independently chosen to be present or absent • Approximately (2m/n 2 ) 3 = 8m 3 /n 6 • Thus, the expected number of triangles in a graph of n nodes and m randomly selected edges • Approximately ( 8m 3 /n 6 )(n 3 /6) = 4 (m/n) 3 CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: How about Social Network Graph? • If a graph is a social network graph, • n nodes ( n users) • m edges (with m pairs of friends) • Do we expect the number of triangle to be, a. Same b. Much greater than the value for a random graph Much smaller than the value for a random graph c. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: How about Social Network Graph? • If a graph is a social network graph, • n nodes ( n users) • m edges (with m pairs of friends) • Do we expect the number of triangle to be, a. Same b. Much greater than the value for a random graph Much smaller than the value for a random graph c. Why? If A and B are friends, and A is also a friend of C, there should be a much greater chance than average that B and C are also friends Counting the number of triangles helps us to measure the extent to which a graph looks like a social network CS435 Introduction to Big Data - Spring 2016 What Else with the Counting Triangles? • Counting the number of triangles helps us to measure the extent to which a graph looks like a social network • It also shows some characteristics of social networks • E.g. the age of a community is related to the density of triangles http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 An Algorithm for Finding Triangles • Suppose we have a graph of n nodes and m ( ≥ n) edges. For convenience, assume the nodes are integers 1, 2, . . . , n • Heavy hitter • If the degree is at least ! • Heavy hitter triangle • Triangle all three of whose nodes are heavy hitters • Note that the number of heavy hitter nodes is no more than 2 ! • Otherwise, the sum of the degrees of the heavy hitter nodes would be more than 2m • Each edge contributes to the degree of only two nodes CS435 Introduction to Big Data - Spring 2016 1. Preparing for the Data Structures • Step 1 . Compute the degree of each node • Examine each edge and add 1 to the count of each of its two nodes • The total time required is O(m) • Step 2. Create an index on edges, with the pair of nodes at its ends as the key. • For the given two nodes, whether the edge between them exists • A hash table suffices • O(m) • Expected time to answer a query about the existence of an edge is a constant • Step 3. Create another index of edges, this one with key equal to a single node • Given a node v , we can retrieve the nodes adjacent to v in time proportional to the number of those nodes http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 2. Sorting Nodes • Sorting nodes • First criteria : By degree • Second criteria: if v and u have the same degree, recall that both v and u are integers, so order them numerically • Therefore, we say v ≺ u if and only if either 1) The degree of v is less than the degree of u , or 2) The degrees of u and v are the same, and v < u CS435 Introduction to Big Data - Spring 2016 3. Counting Triangles [1/2] • Heavy-Hitter Triangles • There are only O( ! ) heavy-hitter nodes • There are O(m 3/2 ) possible heavy-hitter triangles, and using the index on edges we can check if all three edges exist in O(1) time. Therefore, O(m 3/2 ) time is needed to find all the heavy-hitter triangles • Other Triangles • Consider each edge (v 1 ,v 2 ) • If both v 1 and v 2 are heavy hitters, ignore this edge • Suppose that v 1 is not a heavy hitter and moreover v 1 ≺ v 2 • Let u 1 , u 2 , . . . , u k be the nodes adjacent to v 1 • Note that k < √m • We can find these nodes, using the index on nodes, in O(k) time, which is surely O( √ m) time http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 3. Counting Triangles [2/2] • Other Triangles-continued • For each u i we can use the first index to check whether edge ( u i ,v 2 ) exists in O(1) time • We can also determine the degree of u i in O(1) time, because we have counted all the nodes’ degrees • We count the triangle { v 1 ,v 2 ,u i } if and only if the edge ( u i ,v 2 ) exists, and v 1 ≺ u i • A triangle is counted only once • v 1 is the node of the triangle that precedes both other nodes of the triangle according to the ≺ ordering • Time to process all the nodes adjacent to v 1 is O( √ m) • Since there are m edges, the total time spent counting other triangles is O(m 3/2 ) • The time to find heavy hitter triangles is O(m 3/2 ) and so is the time to find the other triangles • Thus, the total time of the algorithm is O(m 3/2 ) CS435 Introduction to Big Data - Spring 2016 GEAR Session 5. Algorithmic Techniques for Big Data Lecture 1. Locality Sensitive Hashing Introduction http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Traditional hash functions • Cryptographic hash function (e.g. SHA-1) • Should be difficult to reverse • Designed to map a data to an integer that can be used to look in a particular bucket within the hash table (e.g. hashtables) • Key properties for the non-cryptographic hash functions • Efficiently computable • Should uniformly distribute the keys • Two inputs will result in hash outputs that are either different or the same based on key properties of the inputs. CS435 Introduction to Big Data - Spring 2016 Locality-sensitive hash functions • Hash value collisions are more likely for two input values • “Close” together than for inputs that are far apart • Many different definitions regarding the “closeness” • Neighboring • Similarity • … http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Finding the most similar documents • Measuring similarity between pairs of documents • Extremely expensive • Example • 1M documents, signatures of length 250 ( 4 Byte each ) • 1M x 1,000Bytes = 1GB • Number of comparisons = 1M C 2 ( Half of trillion pairs) • 1 ms per calculation of similarity • 6 days to complete computing • Do we need to calculate the similarity for all of the pairs? CS435 Introduction to Big Data - Spring 2016 Distance measure • A distance measure over a space is a function d(x,y) that takes two points in the space as arguments and produces a real number that satisfies the following axioms: d(x,y) ≥ 0 (no negative distance) 1. d(x,y) = 0 if and only if x = y 2. d(x,y) = d(y,x) (distance is symmetric) 3. d(x,y) ≤ d(x,z)+d(z,y) (the triangle inequality) 4. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

FAQs Your disk quota is 20GB (per student) If you need more space, - PDF document

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 CS535 BIG DATA PART B. GEAR SESSIONS SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

MBE Vocoder Page 0 of 34 Outline Introduction to vocoders MBE vocoder MBE Parameters

Variability Panel Tom Spyrou TAU 2014 3/2014 Who is responsible for Library Quality n This

Targeted Charging Review Update The webinar will begin shortly Targeted Charging Review Update

Hertfordshires approach to Top -up Funding High Needs Funding (HNF) in Mainstream Schools

Routing Table Status Report March 2005 IPv4 Routing Table Size Data assembled from Route Views

Input part 3: Implementing Interaction Techniques Recap: Interaction techniques A method for

CQI: Whats In It For Me? Building informa.on-sharing

Learning Objectives Every Student Ev Succe Su cceeds Act ct UDL is referenced numerous

Sambuz

Useful Links

Newsletter

Mail Us