CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Today is the last day of discussion period for Session III on Piazza • Watch video clips on Canvas à Assignments à Echo360 • Term project phase 1 (Proposal) • Feedbacks are available in Canvas PART B. GEAR SESSIONS • Please arrange a meeting if needed SESSION 4: LARGE SCALE RECOMMENDATION SYSTEMS AND SOCIAL MEDIA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • Part 1: Distributed implementation of Triplets View in GraphX • Recommendation Systems • Part 2: Introduction and Content based recommendation systems GEAR Session 3. Big Graph Analysis • Part 3: Collaborative Filtering (Case study of Amazon’s Item-to-Item model and Netflix’ Latent Factor Lecture 3. Distributed Large Graph Analysis-II Model) GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Efficient lookup of edges Index Reuse • GraphX inherits the immutability of Spark • Edges within a partition are clustered by source vertex id using a compressed sparse • All graph operators logically create new collections rater than destructively modifying existing ones row (CSR) representation and hash-indexed by their target id • CSR with an example • Derived vertex and edge collections can often share indices to reduce memory overhead and improve local performance • With a sparse m x n matrix M • Hash index on vertices can enable fast aggregation and resulting aggregates share the index with the original vertices • Using three (1 dimensional) arrays (", $%& '()*+ , ,%- '()*+ ) 0 0 0 0 • Faster Joins 5 8 0 0 • Vertex collections sharing the same index can be joined by a coordinated scan • 0 0 3 0 • Without requiring any index lookups 0 6 0 0 • Index reuse reduces the per-iteration runtime of PageRank on the twitter graph by 59 % (GraphX paper) • " = 5 • Operators that do not modify the graph structure (e.g. mapV) automatically preserve indices 8 3 6 Define • Col '()*+ = 0 1 2 1 row_start = ROW_INDEX[row] • Operators that restrict the graph structure (e.g. subgraph) relies on bitmasks to construct restricted views • column indices row_end = ROW_INDEX[row+1] • reindex operator • ,%- '()*+ = 0 0 2 3 4 • For the operator changes the structure heavily (e.g. filtered ) • index in V where the given row starts http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Implementing the Triplets View Implementing the Triplets View: Vertex Mirroring • Triplets view • Join requires data movement • Three way join between the source and destination vertex properties and the edge properties • Vertex and edge property collections are partitioned independently • Three-way join • Shipping the vertex properties across the network to the edges • Setting the edge partitions as the join sites • Vertex Mirroring • Multicast Join • Partial Materialization • Observation 1: Real-world graphs commonly have orders of magnitude more • Incremental View Maintenance edges than vertices • Observation 2: A single vertex may have many edges in the same partition • Enabling substantial reuse of the vertex property CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Implementing the Triplets View: Multicast Join Implementing the Triplets View: Partial Materialization • Broadcast join • Local joins at the edge partitions • All vertices are sent to each edge partition • Mirrored vertex properties are stored in local hash maps on each edge partition • Referenced when the triplets are constructed • Multicast join • Each vertex property is sent only to the edge partitions that contain adjacent edges • Join site information is stored in the routing table • Co-partitioned with the vertex collection • Routing table is associated with the edge collection • Routing table is constructed lazily upon first instantiation of the triplets view • Example • Per-city partitioning scheme on the Facebook social network graph • 50.5% reduction in query time CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Implementing the Triplets View: Incremental View Query Optimizations for the mrTriplets operator Maintenance • Iterative graph algorithms often modify only a subset of the vertex properties in each • Filtered Index Scanning iteration • myTriplets operator logically involves a scan of the triplets view to apply user-defined map function • Incremental view maintenance • As iterative graph algorithms converge, the working sets tend to shrink • To avoid unnecessary movement of unchanged data • Map function skips many Triplets • After each graph operation • Active set • You can track which vertex properties have changed since the triplets view was last constructed • Map function only need to operate on triplets containing active vertices • When the triplets view is accessed next time • Defined by the application specific predicate • Only the changed vertices are re-routed to their edge-partition join sites • E.g. connected component analysis • Local mirrored values of the unchanged vertices are reused • Indexed scan for the triplets view • Application expresses the current active set by restricting the graph using subgraph operator • Filter the triplets using this vertex predicate http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2
CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Query Optimizations for the mrTriplets operator Additional Optimizations • Automatic Join Elimination • Memory-based Shuffle • Some operations on triplets view may access only one of the vertex properties or non • Spark’s default shuffle implementation materializes the temporary data to disk • GraphX modified the shuffle phase to materialize map outputs in memory and remove this temporary at all data using a timeout • E.g. counting the degree of each vertex • Batching and Columnar Structure • GraphX uses a JVM’s bytecode analyzer to inspect user defined functions at runtime • In the join code, batch a block of vertices routed to the same target join site and convert the block from • Check whether the source or destination vertex properties is referred row-oriented format to column-oriented format • If only one property is referenced and the triplets view has not been already • Apply the LZF compression algorithm on these blocks to send them materialized • Variable Integer Encoding • GraphX rewrites the query plan for generating the triplets view • While GraphX uses 64-bit vertex ids, most of ids are smaller than 2 64 • From three-way join to a two-way join • GraphX uses a variable-encoding scheme • If none of the vertex properties are referenced • Uses only first 7 bits to encode the value • GraphX eliminates the join entirely CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University GEAR Session 4. Large Scale Recommendation Systems and Social Media “What percentage of the top 10,000 titles in any online media store (Netflix, iTunes, Lecture 1. Large Scale Recommendation Systems Amazon, or any other) will rent or sell at least once a month?” Recommendation Systems: Introduction CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University The long tail phenomenon [1/2] Recommendation systems • Distribution of numbers with a portion that has a large number of occurrences far from • Seek to predict the “rating” or “preference” that a user would give to an item the “head” or central part of the distribution • The vertical axis represents popularity • The items are ordered on the horizontal axis according to their popularity • The long-tail phenomenon forces online institutions to recommend items to individual users Erik Brynjolfsson, Yu (Jeffrey) Hu, and Duncan Simester. 2011. Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales. Manage. Sci. 57, 8 (August 2011), 1373-1386. DOI=http://dx.doi.org/10.1287/mnsc.1110.1371 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3
Recommend
More recommend