PageRank and recommenders on very large scale A Big Data - PowerPoint PPT Presentation

PageRank and recommenders on very large scale PageRank and recommenders on very large scale A Big Data perspective through Stratosphere Márton Balassi Data Mining and Search Group 1 1 Computer and Automation Research Institute of the Hungarian Academy of Sciences May 8, 2014

PageRank and recommenders on very large scale Table of Contents Distributing data-intensive algorithms Stratosphere Input Contracts PageRank and recommender systems Reference

PageRank and recommenders on very large scale Distributing data-intensive algorithms Table of contents Distributing data-intensive algorithms Stratosphere Input Contracts PageRank and recommender systems Reference

PageRank and recommenders on very large scale Distributing data-intensive algorithms Motivation Motivation Let’s do a PageRank on this graph. . . ◮ The soc-LiveJournal1 provided by Stanford LNDC 1 ◮ 4 . 8 · 10 6 nodes ◮ 6 . 9 · 10 7 edges ◮ 250 MB of compressed data ◮ „Conventional” single machine solution seems sufficient 1 Stanford Large Network Dataset Collection

PageRank and recommenders on very large scale Distributing data-intensive algorithms Motivation Motivation Let’s do a PageRank on this graph. . . ◮ A large Portugese webcrawl 1 ◮ 3 . 1 · 10 9 nodes ◮ 1 . 1 · 10 11 edges ◮ 80 GB of compressed data ◮ Divide and conquer is almost mandatory 1 a large Portuguese crawl of the Portuguese Web Archive obtained from Daniel Gomes

PageRank and recommenders on very large scale Distributing data-intensive algorithms MapReduce MapReduce

PageRank and recommenders on very large scale Distributing data-intensive algorithms Pregel Pregel Traits ◮ Bulk Synchronous Parallel ◮ „Think like a vertex” ◮ Graph kept in memory Scheme of the BSP system Wikipedia, public domain

PageRank and recommenders on very large scale Distributing data-intensive algorithms Counting the number of triangles in a graph Triangle Counter – Sequential algorithm Sequential algorithm Every vertex executes a search of itself bounded in depth of three. Thus every triangle is counted three times.

PageRank and recommenders on very large scale Distributing data-intensive algorithms Counting the number of triangles in a graph Triangle Counter – MapReduce algorithm 0 1 Representation 0 1 2 1 2 2 0 3 3 2

PageRank and recommenders on very large scale Distributing data-intensive algorithms Counting the number of triangles in a graph Triangle Counter – MapReduce algorithm First Map 0 Let’s send our ID to all of our 0 1 neighbours possessing a higher ID than ours. Let’s send our neighbours to ourselves. 1 First Reduce 2 Let’s write out the information received.

PageRank and recommenders on very large scale Distributing data-intensive algorithms Counting the number of triangles in a graph Triangle Counter – MapReduce algorithm Second Map If the ID received is smaller then 0 [] 1 [ 0 ] ours let’s pass it on to our neighbours. Let’s send our neighbours to ourselves. 1 Second Reduce 2 [ 1 ] If the ID received is our neighbour then let’s increment a global counter.

PageRank and recommenders on very large scale Distributing data-intensive algorithms Counting the number of triangles in a graph Triangle Counter – MapReduce algorithm Second Map If the ID received is smaller then ours let’s pass it on to our 0 + + 1 neighbours. Let’s send our neighbours to ourselves. Second Reduce 2 If the ID received is our neighbour then let’s increment a global counter.

PageRank and recommenders on very large scale Distributing data-intensive algorithms Counting the number of triangles in a graph Runtime of the three solutions

PageRank and recommenders on very large scale Stratosphere Input Contracts Table of contents Distributing data-intensive algorithms Stratosphere Input Contracts PageRank and recommender systems Reference

PageRank and recommenders on very large scale Stratosphere Input Contracts Map Map Wordcount Map For lines of input text emit ( word , 1 ) for each word.

PageRank and recommenders on very large scale Stratosphere Input Contracts Map Map public static class TokenizeLine extends MapStub implements Serializable { private static final long serialVersionUID = 1L; // initialize reusable mutable objects private final PactRecord outputRecord = new PactRecord(); private final PactString word = new PactString(); private final PactInteger one = new PactInteger(1); @Override public void map(PactRecord record, Collector<PactRecord> collector) { // get the first field (as type PactString) from the record PactString line = record.getField(0, PactString.class); // normalize the line with AsciiUtils ... // tokenize the line this.tokenizer.setStringToTokenize(line); while (tokenizer.next(this.word)){ // emit a (word, 1) pair this.outputRecord.setField(0, this.word); this.outputRecord.setField(1, this.one); collector.collect(this.outputRecord); } } }

PageRank and recommenders on very large scale Stratosphere Input Contracts Map Map

PageRank and recommenders on very large scale Stratosphere Input Contracts Reduce Reduce Wordcount Reduce For multiple instances of ( word , 1 ) count frequency of each word.

PageRank and recommenders on very large scale A Big Data - PowerPoint PPT Presentation

PageRank and recommenders on very large scale PageRank and recommenders on very large scale A Big Data perspective through Stratosphere Mrton Balassi Data Mining and Search Group 1 1 Computer and Automation Research Institute of the Hungarian

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW

Learning argumentative recommenders Olivier Cailloux LAMSADE, Universit Paris-Dauphine 22 nd

2. Recommender Systems Recommenders Everywhere Advanced Topics in Information Retrieval /

PAGERANK-RELATED METHODS FOR ANALYZING CITATION NETWORKS Author: Ludo Waltman and Erjia Yan

PageRank Google's PageRank algorithm. [Sergey Brin and Larry Page, 1998] Measure

Web and PageRank Lecture 4 CSCI 4974/6971 12 Sep 2016 1 / 16 Todays Biz 1. Review MPI 2.

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search Overview

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search 1 Overview

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all

Personalized PageRank Document Understanding, session 4 CS6200: Information Retrieval

0.1 Naive formulation of PageRank In general, PageRank is a way to rank nodes on a graph. Let r i

Lin inear programming Example Numpy: PageRank scipy.optimize.linprog Example linear

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

New Developments in Silicon Detectors (at Max Planck Society Semiconductor Lab) Jelena Ninkovic

Low Frequency Noise in Advanced MOS Transistors Chia Yu Chen Department of Electrical

UNDERSTANDING EXPLORATION OF NEW CAPABILITIES USING REAL OPTIONS AND DIVERSIFICATION LENSES *

Jane Hyatt Thorpe, Lara Cartwright-Smith, and Elizabeth Gray www.healthinfolaw.org Who We Are

Neurovascular Coupling Mark Freeman Adam Mauskopf Shuyan Mei Kimberly Stanke Zihao Yan Fields

Good Clinical Practice Guidance and Pragmatic Trials: Balancing the Best of Both Worlds in the

Category II Tracings: Does Fetal Resuscitation Work? Brian L. Shaffer, MD Associate Professor

Disclosures Electronic Fetal Monitoring The Good, the Bad and the Ugly I have nothing to

PageRank and recommenders on very large scale A Big Data - PowerPoint PPT Presentation

PageRank and recommenders on very large scale PageRank and recommenders on very large scale A Big Data perspective through Stratosphere Mrton Balassi Data Mining and Search Group 1 1 Computer and Automation Research Institute of the Hungarian

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation

PageRank CS16: Introduction to Data Structures &amp; Algorithms Spring 2020 Outline The WWW

Learning argumentative recommenders Olivier Cailloux LAMSADE, Universit Paris-Dauphine 22 nd

2. Recommender Systems Recommenders Everywhere Advanced Topics in Information Retrieval /

PAGERANK-RELATED METHODS FOR ANALYZING CITATION NETWORKS Author: Ludo Waltman and Erjia Yan

PageRank Google's PageRank algorithm. [Sergey Brin and Larry Page, 1998] Measure

Web and PageRank Lecture 4 CSCI 4974/6971 12 Sep 2016 1 / 16 Todays Biz 1. Review MPI 2.

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search Overview

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search 1 Overview

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

IV.4 Topic-Specific &amp; Personalized PageRank PageRank produces one-size-fits-all

Personalized PageRank Document Understanding, session 4 CS6200: Information Retrieval

0.1 Naive formulation of PageRank In general, PageRank is a way to rank nodes on a graph. Let r i

Lin inear programming Example Numpy: PageRank scipy.optimize.linprog Example linear

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

New Developments in Silicon Detectors (at Max Planck Society Semiconductor Lab) Jelena Ninkovic

Low Frequency Noise in Advanced MOS Transistors Chia Yu Chen Department of Electrical

UNDERSTANDING EXPLORATION OF NEW CAPABILITIES USING REAL OPTIONS AND DIVERSIFICATION LENSES *

Jane Hyatt Thorpe, Lara Cartwright-Smith, and Elizabeth Gray www.healthinfolaw.org Who We Are

Neurovascular Coupling Mark Freeman Adam Mauskopf Shuyan Mei Kimberly Stanke Zihao Yan Fields

Good Clinical Practice Guidance and Pragmatic Trials: Balancing the Best of Both Worlds in the

Category II Tracings: Does Fetal Resuscitation Work? Brian L. Shaffer, MD Associate Professor

Disclosures Electronic Fetal Monitoring The Good, the Bad and the Ugly I have nothing to

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all