an optimized data exchange policy Hisham Mohamed and Stphane - PowerPoint PPT Presentation

Enhancing MapReduce using MPI and an optimized data exchange policy Hisham Mohamed and Stéphane Marchand-Maillet Viper group, CVML Laboratory, University of Geneva September 10, 2012 Fifth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2012 1

Outline • Motivation • MapReduce • MapReduce overlapping using MPI (MRO-MPI) • Experiments – Wordcount – Distributed inverted files. • Conclusion 2

Motivation • Cross Modal Search Engine (CMSE) 3

Motivation • Scalability – Multimedia Data increases rapidly. • Indexing • Searching – High dimensional data. 4

Our proposed solution • In CMSE, we need data and algorithm parallelization. • MapReduce overlapping using MPI (MRO-MPI) – C/C++ implementation of MapReduce using MPI. – Improving the MapReduce Model. – Maintain the usability of the Model. 5

MapReduce • MapReduce brings a simple and powerful interface for data parallelization, by keeping the user away from the communications and the exchange of data. 7

MapReduce • The current model for MapReduce has at least three bottlenecks: – Dependence. – Multiple Disk access. – All-to-All communication. 8

MapReduce Overlapping (MRO) • Send partial intermediate (Km, Vm) pairs to the responsible reducers. • We rule out: – The multiple read/write. – Shuffling phase is merged with the mapping phase. – Reducers do not wait until the mappers finish their work. • Difficulties: – Rate of sending data between Mappers and Reducers. – The ratio between the Mappers and Reducers 9

MapReduce Overlapping using MPI (MRO-MPI) • MapReduce – Data parallelization. • Message Passing Interface (MPI) – Separate processes with a unique rank . – MPI supports point-to-point, one-to-all, all-to-one and all-to-all communications. – Communication between processes. • MapReduce-MPI – Based on the original MapReduce Model 10

MRO-MPI Time 11

MRO-MPI Time 12

MRO-MPI Time 13

MRO-MPI Time 14

MRO-MPI – Rate of Sending the data Time 15

MRO-MPI Time 16

MRO-MPI Time 17

MRO-MPI – Same simple interface – Extra parameters: Time – Rate of sending data. – Number of Mappers to Reducers. – Data type. 18

WordCount • WordCount: – Reads text files and counts how often words occur. – Input data size varies from 0.2Gb to 53Gb from project Gutenberg. 20

WordCount • MRO-MPI: 24 as mappers and 24 as reducers. • MR-MPI: 48 cores are used as mappers then as reducers. • Hadoop: 48 reducers and the number of mappers varies according to the number of partial input files. Speedup: 1.9x 5.3x X-axis: Data size in gigabytes. Y-axis: log 10 of the running time. Values in the table show the 21 running time in seconds. Values above the columns shows the size of each chuck.

Inverted Files • Inverted Files is an indexing structure composed of two elements: the vocabulary and the posting lists. – Vocabulary – Posting lists Name= Doc1 #id=1 Vocabulary Posting Lists Computer security known as information security as applied to computers and networks......... apply <1,tf-idf>, Name= Doc1 #id=2 …. clash …. MapReduce has been used as a framework for Corpora distributing larger corpora........ <1,tf-idf> Compute Name= Doc1 #id=3 framework Protesters have been clashing with security forces. No <2,tf-idf> force information....... information …… <1,tf-idf>,<3,tf-idf> large …… MapReduce …. …… …. networks …. …… protest 23 <1,tf-idf>,<3,tf-idf> security

Inverted Files – tf-idf • tf-idf - weighting scheme (SMART system,1988): – Used to evaluate how important a word in a document with respect to other documents in the corpus. – Term Frequency (tf): • : number of occurrence of term in document . – Inverse Document Frequency (idf): : number of documents where appears. • • : total number of documents. 24

MRO-MPI for inverted files • Mappers: – (K m , V m ) = (term, (document name, tf)). • Reducers: T1 – Distributes the data based on their lexicographic order, each reducer being responsible for a certain range of words. – Similar terms are saved into the same database, reducer nodes can calculate the correct tf-idf value. 25

Distributed inverted files • 9,319,561 text (XML) excerpts related to 9,319,561 images from 12 million ImageNet corpus. • Data size: 36GB of XML data. • Hadoop: 40 minutes with 26 Reducers. • Double speedup because of sending the data while the map functions is working. • The best ratio between the mappers and reducers is found to be: 26

Conclusion • We proposed MRO-MPI for intensive data processing. • Maintain the simplicity of MapReduce. • High speedup with the same number of nodes. 27

Questions ? 28

an optimized data exchange policy Hisham Mohamed and Stphane - PowerPoint PPT Presentation

Enhancing MapReduce using MPI and an optimized data exchange policy Hisham Mohamed and Stphane Marchand-Maillet Viper group, CVML Laboratory, University of Geneva September 10, 2012 Fifth International Workshop on Parallel Programming Models

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

ZIVD, LLC 1 Laboratory Optimized patient care Clinician Optimized patient care 2

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

Moving Shadow Tracking in VR Interaction A novel optimized approach A novel optimized approach

Hong Kong Hong Kong I nternet I nternet Exchange Exchange Exchange Exchange (HKI X (HKI X)

Facebook Exchange Facebook Exchange (FBX) (FBX) Facebook Exchange The Facebook Exchange allows

ASPRS LiDAR Data Exchange Format Standard ASPRS LiDAR Data Exchange Format Standard LAS IIT

Helping families understand the Tuition Exchange program and process JANET DODSON 2 What is

Wisconsin- -Hessen Exchange Hessen Exchange Wisconsin Origin of WI- -Hessen Exchange Hessen

EXCHANGE CONTROL EXCHANGE CONTROL MEANS OFFICIAL INTERFERENCE IN THE FOREIGN EXCHANGE DEALINGS

CHSP Advisory Group 6 April 2016 Data Exchange 1 Principles of the Data Exchange Reduce red

MERCURY Optimized Software for Hybrid Simulation; from Pseudo-Dynamic to Hard Real Time V.

More cheese from less milk: eco-innovative real-time milk classification technology for optimized

Implementing and extending the Optimized Link State Routing Protocol Master presentation by

Optimized extraction of microalgaes microalgaes metabolites: a crucial step in metabolites:

Complexity: Highly Optimized Tolerance C. A. Pearson cap10@gwu.edu Papers J.M.Carlson and

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one)

Integrating Structured Data and Text A Tagged Document < DOC > <

Feature selection LING 572 Advanced Statistical Methods for NLP January 21, 2020 1

The effect of parental job loss on child school dropout: evidence from the Occupied Palestinian

Efficient visual search of local features Cordelia Schmid Visual search change in viewing

an optimized data exchange policy Hisham Mohamed and Stphane - PowerPoint PPT Presentation

Enhancing MapReduce using MPI and an optimized data exchange policy Hisham Mohamed and Stphane Marchand-Maillet Viper group, CVML Laboratory, University of Geneva September 10, 2012 Fifth International Workshop on Parallel Programming Models

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

ZIVD, LLC 1 Laboratory Optimized patient care Clinician Optimized patient care 2

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

Moving Shadow Tracking in VR Interaction A novel optimized approach A novel optimized approach

Hong Kong Hong Kong I nternet I nternet Exchange Exchange Exchange Exchange (HKI X (HKI X)

Facebook Exchange Facebook Exchange (FBX) (FBX) Facebook Exchange The Facebook Exchange allows

ASPRS LiDAR Data Exchange Format Standard ASPRS LiDAR Data Exchange Format Standard LAS IIT

Helping families understand the Tuition Exchange program and process JANET DODSON 2 What is

Wisconsin- -Hessen Exchange Hessen Exchange Wisconsin Origin of WI- -Hessen Exchange Hessen

EXCHANGE CONTROL EXCHANGE CONTROL MEANS OFFICIAL INTERFERENCE IN THE FOREIGN EXCHANGE DEALINGS

CHSP Advisory Group 6 April 2016 Data Exchange 1 Principles of the Data Exchange Reduce red

MERCURY Optimized Software for Hybrid Simulation; from Pseudo-Dynamic to Hard Real Time V.

More cheese from less milk: eco-innovative real-time milk classification technology for optimized

Implementing and extending the Optimized Link State Routing Protocol Master presentation by

Optimized extraction of microalgaes microalgaes metabolites: a crucial step in metabolites:

Complexity: Highly Optimized Tolerance C. A. Pearson cap10@gwu.edu Papers J.M.Carlson and

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one)

Integrating Structured Data and Text A Tagged Document &lt; DOC &gt; &lt;

Feature selection LING 572 Advanced Statistical Methods for NLP January 21, 2020 1

The effect of parental job loss on child school dropout: evidence from the Occupied Palestinian

Efficient visual search of local features Cordelia Schmid Visual search change in viewing

Integrating Structured Data and Text A Tagged Document < DOC > <