Rule Based Classification on a Multi Node Scalable Hadoop Cluster - PowerPoint PPT Presentation

Rule Based Classification on a Multi Node Scalable Hadoop Cluster Shashank Gugnani Devavrat Khanolkar BITS Pilani Tushar Bihany K K Birla Goa Campus Nikhil Khadilkar

Data Hypergrowth  Reuters-21578: about 10K docs (ModApte)  Bekkerman et al, SIGIR 2001  RCV1: about 807K docs  Bekkerman & Scholz, CIKM 2008  LinkedIn job title data: about 100M docs  Bekkerman & Gavish, KDD 2011  Common Crawl Corpus: 5 Billion docs  Common Crawl Foundation, 2014 9/29/2014 BITS Pilani, K K Birla Goa Campus

New Age of Data  The world has gone mobile  5 billion cellphones produce daily data  Social networks have gone online  Twitter produces 200M tweets a day  The web is growing  1M new websites created everyday Source: mediapost.com, bigdatainsightsgroup.com, bbcnews.com 9/29/2014 BITS Pilani, K K Birla Goa Campus

What is MapReduce?  Data-parallel programming model for clusters of commodity machines  Pioneered by Google  Processes 20 PB of data per day  Popularized by Apache Hadoop project Used by Yahoo!, Facebook, Amazon, …   Scalable to large data volumes  Scan 100 TB on 1 node @ 50 MB/s = 24 days  Scan on 1000-node cluster = 35 minutes 9/29/2014 BITS Pilani, K K Birla Goa Campus

What is MapReduce? Map function: (K in , V in )  list<(K inter , V inter )> Reduce function: (K inter , list<V inter >)  list<(K out , V out )> 9/29/2014 BITS Pilani, K K Birla Goa Campus

Hadoop 9/29/2014 BITS Pilani, K K Birla Goa Campus

Rule Based Classification  Classification method in which classifier consists of rules  Rule : (Condition) → y Where,  Condition is a conjunction of attribute tests  ( A1 = v1) and (A2 = v2) and … and (An = vn)  y is the class label  LHS: rule antecedent or condition  RHS: rule consequent  Eg. (Blood Type = warm) ᴧ (Lays Eggs = yes) → Birds  Eg. (Give Birth = no) ᴧ (Live in water = yes) → Fishes 9/29/2014 BITS Pilani, K K Birla Goa Campus

RIPPER  Repeated Incremental Pruning for Error Reduction  Builds rules by adding attribute tests one by one to condition  Uses FOIL’s information gain to select best attribute test to add  FOIL’s information gain = p 1 × ( log p 1 /(p 1 + n 1 ) − log p 0 /(p 0 + n 0 ) )  Rules are pruned using pruning metric  Pruning metric = ( p – n )/( p + n ) 9/29/2014 BITS Pilani, K K Birla Goa Campus

RIPPER Rule Rule Test Building Pruning Model 9/29/2014 BITS Pilani, K K Birla Goa Campus

RIPPER with Hadoop  Each step requires calculation of p and n values which means going over the whole dataset  Could take a lot of time if dataset large  Use Hadoop to parallely calculate p and n values  Use p and n as key values in Map and Reduce functions  Significant time reduction 9/29/2014 BITS Pilani, K K Birla Goa Campus

RIPPER with Hadoop Repeat until all rules complete Calculate p and n Find MAX FOIL’s IG values for all and add attribute attributes using test to rule Hadoop Repeat until all rules pruned Calculate p and n Find model values for pruning accuracy using Prune Rule if viable metric using Hadoop to calculate Hadoop p and n values 9/29/2014 BITS Pilani, K K Birla Goa Campus

Experiments  Two Datasets used  Randomly generated dataset – 100M Records, 22 Attributes, 2 classes  Sloan Digital Sky Survey (SDSS) Dataset – 2.5M Records, 6 Attributes, 2 classes  Cluster Configuration  4 nodes  Hadoop 1.0  Gigabit Ethernet  Experiments run on both datasets  Vary number of nodes in cluster  Speed up almost linear with number of nodes  Algorithm scalable 9/29/2014 BITS Pilani, K K Birla Goa Campus

Results 9/29/2014 BITS Pilani, K K Birla Goa Campus

References 1. Bekkerman, Ron, et al. "On feature distributional clustering for text categorization." Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval . ACM, 2001. 2. Bekkerman, Ron, and Martin Scholz. "Data weaving: Scaling up the state-of-the-art in data clustering." Proceedings of the 17th ACM conference on Information and knowledge management . ACM, 2008. 3. Bekkerman, Ron, and Matan Gavish. "High-precision phrase-based document classification on a modern scale." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 2011. 4. Apache Hadoop. http://hadoop.apache.org/. Accessed 18/09/2014. 5. Cohen, William W. "Fast Effective Rule Induction." Proceedings of the Twelfth International Conference on Machine Learning, Lake Tahoe, California . 1995. 6. Sloan Digital Sky Survey DR 10. http://skyserver.sdss3.org/dr10/en/home.aspx. Accessed 18/09/2014. 9/29/2014 BITS Pilani, K K Birla Goa Campus

Rule Based Classification on a Multi Node Scalable Hadoop Cluster - PowerPoint PPT Presentation

Rule Based Classification on a Multi Node Scalable Hadoop Cluster Shashank Gugnani Devavrat Khanolkar BITS Pilani Tushar Bihany K K Birla Goa Campus Nikhil Khadilkar Data Hypergrowth Reuters-21578: about 10K docs (ModApte) Bekkerman

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Scalable node addressing Scalable node addre and message routing for global

Hadoop: Scalable Infrastructure for Big Data QCon London 2012 Parand Tony Darugar Founder and

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Analog Electronic Circuits Prof. Mor M. Peretz The Center for Power Electronics and Mixed-Signal

Intro to Electronics Week 2 Intro to Electronics, Week 2 Last updated Oct. 17, 2012 1 Build a

Mixture Models Simulation-based Estimation Michel Bierlaire Transport and Mobility Laboratory

Database Programming in SQL/O RACLE SQL-3 Standard/ORACLE 8: ER-Modeling Schema

Outline Presentation of ENST Digital security in 2005: crisis, stakes and roadmap

February 2013 PRISM Technical Support Volpe Center 2013 TWO MAJOR PROCESSES 1. Commercial

Insert Training Date What is PRISM? TWO MAJOR PROCESSES 1. Commercial Vehicle Registration 2.

CISC 323 Intro to Software Engineering Example: Marks management system Topic 7: Software

Rule Based Classification on a Multi Node Scalable Hadoop Cluster - PowerPoint PPT Presentation

Rule Based Classification on a Multi Node Scalable Hadoop Cluster Shashank Gugnani Devavrat Khanolkar BITS Pilani Tushar Bihany K K Birla Goa Campus Nikhil Khadilkar Data Hypergrowth Reuters-21578: about 10K docs (ModApte) Bekkerman

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node-&gt;m_data == value) {

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Scalable node addressing Scalable node addre and message routing for global

Hadoop: Scalable Infrastructure for Big Data QCon London 2012 Parand Tony Darugar Founder and

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Analog Electronic Circuits Prof. Mor M. Peretz The Center for Power Electronics and Mixed-Signal

Intro to Electronics Week 2 Intro to Electronics, Week 2 Last updated Oct. 17, 2012 1 Build a

Mixture Models Simulation-based Estimation Michel Bierlaire Transport and Mobility Laboratory

Database Programming in SQL/O RACLE SQL-3 Standard/ORACLE 8: ER-Modeling Schema

Outline Presentation of ENST Digital security in 2005: crisis, stakes and roadmap

February 2013 PRISM Technical Support Volpe Center 2013 TWO MAJOR PROCESSES 1. Commercial

Insert Training Date What is PRISM? TWO MAJOR PROCESSES 1. Commercial Vehicle Registration 2.

CISC 323 Intro to Software Engineering Example: Marks management system Topic 7: Software

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {