Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo - PowerPoint PPT Presentation

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Thursday, March 8, 12

Outline • Big Data @ Facebook - Scope & Scale • Evolution of Big Data Architectures @ FB • Past, Present and Future • Questions Thursday, March 8, 12

Big Data @ FB: Scale • 25 PB of compressed data • equivalent to 300 years of HD-TV video Thursday, March 8, 12

Big Data @ FB: Scale • 150 PB of uncompressed data • equivalent to 3 x the entire written works of mankind from the beginning of recorded history in all languages Thursday, March 8, 12

Big Data @ FB: Scale • 400 TB/day (uncompressed) of new data • That is a lot of disks Thursday, March 8, 12

Big Data @ FB: Scope • Simple reporting • Model generation • Adhoc analysis + data science • Index generation • Many many others... Thursday, March 8, 12

A/B Testing Email #1 Thursday, March 8, 12

A/B Testing Email #2 Thursday, March 8, 12

A/B Testing Email #2 is 3x Better Thursday, March 8, 12

Friend Map By Paul Butler - https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/ 469716398919 Thursday, March 8, 12

Big Data @ FB: Scope • one new job every second • ~ 15% of the company uses the clusters Thursday, March 8, 12

Evolution: 2007-2011 DW Size in TB 30000 25000 22500 15000 8000 7500 800 250 15 0 2007 2008 2009 2010 2011 Thursday, March 8, 12

2007: Traditional EDW Thursday, March 8, 12

2007: Traditional EDW Web Clusters MySQL Clusters Thursday, March 8, 12

2007: Traditional EDW Web Clusters RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

2007: Traditional EDW Scribe Mid-Tier Web Clusters RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

2007: Traditional EDW Scribe Mid-Tier Web Clusters NAS Filers RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

2007: Traditional EDW Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters RDBMS Data Warehouse Thursday, March 8, 12

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours RDBMS Data Warehouse Thursday, March 8, 12

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. RDBMS Data Warehouse Thursday, March 8, 12

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse Thursday, March 8, 12

2007: Pain Points - compute close to storage (early map/reduce) Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse Thursday, March 8, 12

2007: Limitations • Most use cases were in business metrics - data science, model building etc. not possible • Only summary data was stored online - details archived away Thursday, March 8, 12

2008: Move to Hadoop Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

2008: Move to Hadoop Batch Scribe Mid-Tier copier/ Web Clusters loaders Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart Thursday, March 8, 12

2008: Immediate Pros • Data science at scale became possible • For the first time all of the instrumented data could be held online • Use cases expanded Thursday, March 8, 12

2009: Democratizing Data Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart Thursday, March 8, 12

2009: Democratizing Data Databee & Nectar: Chronos: Data instrumentation & Pipeline schema aware Framework data collection HiPal: Adhoc Scrapes: Hadoop/Hive Data Queries + Data Configuration Warehouse Discovery Driven Thursday, March 8, 12

2009: Democratizing Data(Nectar) • Typical Nectar Pipeline • Simple schema evolution built in • json encoded short term data • decomposing json for long term storage Thursday, March 8, 12

2009: Democratizing Data (Tools) • HiPal - data discovery and query authoring • Charting and dashboard generation tools Thursday, March 8, 12

2009: Democratizing Data (Tools) • Databee: Workflow language • Chronos: Scheduling tool Thursday, March 8, 12

2009: Cons of Democratization • Isolation to protect against Bad Jobs • Fair sharing of the cluster - what is a high priority job and how to enforce it Thursday, March 8, 12

2010: Controlling Chaos • Isolation • Reducing operational overhead • Better resource utilization • Measurement, ownership, accountability Thursday, March 8, 12

2010: Isolation Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters Thursday, March 8, 12

2010: Isolation Scribe Mid-Tier Web Clusters Platinum Warehouse Hive Replication NAS Filers MySQL Clusters Silver Warehouse Thursday, March 8, 12

2010: Ops Efficiency Scribe HDFS Web Clusters Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse Thursday, March 8, 12

2010: Ops Efficiency Scribe HDFS Web Clusters Platinum Warehouse Hive Replication near real time data consumers MySQL Clusters Silver Warehouse Thursday, March 8, 12

2010: Ops Efficiency Scribe HDFS Web Clusters ptail: Platinum Warehouse parallel tail Hive Replication on hdfs near real time data consumers MySQL Clusters Silver Warehouse Thursday, March 8, 12

2010: Resource Utilization (Disk) • HDFS-RAID: from 3 replicas to 2.2 replicas • RCFile: Row columnar format for compressing Hive tables Thursday, March 8, 12

2010: Resource Utilization (CPU) • Continuous copier/ loaders • Incremental scrapes • Hive optimizations to save CPU Thursday, March 8, 12

2010: Monitoring(SLAs) • Per job statistics rolled up to owner/group/team • Expected time of arrival vs Actual time of arrival of data • Simple data quality metrics Thursday, March 8, 12

2011: New Requirements • More real time requirements for aggregations • Optimizing resource utilization Thursday, March 8, 12

2011: Beyond Hadoop • Puma for real time analytics • Peregrine for simple and fast queries Thursday, March 8, 12

2010: Puma Scribe HDFS Web Clusters ptail: Platinum Warehouse parallel tail Hive Replication on hdfs near real time data consumers MySQL Clusters Silver Warehouse Thursday, March 8, 12

2010: Puma Thursday, March 8, 12

2010: Puma Scribe HDFS ptail: parallel tail on hdfs Thursday, March 8, 12

2010: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Thursday, March 8, 12

2010: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Hbase Cluster Thursday, March 8, 12

Other Challenges Of HyperGrowth • Moving data centers • Moving sustainably fast Thursday, March 8, 12

HyperGrowth - Moving Data Centers DW Size in TB 30000 25000 22500 15000 8000 7500 800 250 15 0 2007 2008 2009 2010 2011 Thursday, March 8, 12

HyperGrowth - Moving Data Centers • Moved 20 PB of data • Leverage replication with fast switch • 2-3 months to accomplish the entire move Blog Post on FB by Paul Yang: http://www.facebook.com/notes/paul-yang/moving-an-elephant-large- scale-hadoop-data-migration-at-facebook/10150246275318920 Thursday, March 8, 12

Questions Contact Information: ashish.thusoo@gmail.com http://www.linkedin.com/pub/ashish-thusoo/0/5a8/50 https://www.facebook.com/athusoo https://twitter.com/ashishthusoo Thursday, March 8, 12

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo - PowerPoint PPT Presentation

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Thursday, March 8, 12 Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions Thursday, March

Facebook Exchange Facebook Exchange (FBX) (FBX) Facebook Exchange The Facebook Exchange allows

Facebook Strategies Facebook www.facebook.com Facebook TIPS Idea #1: Share the School Calendar.

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Architectures Architectural styles Software architectures Architectures versus middleware

GETTING STARTED WITH FACEBOOK ADVERTISING 1.Facebook Ads Growth 2.Why theyre popular

Introducing Live for Facebook Available Now (beta) Coming Soon Available On Facebook Mentions

One Trillion Edges: Graph Processing at Facebook-Scale GraphHPC 2015, Moscow Avery Ching Sergey

MySQL Replication and HA at Facebook Part-II Jeff Jiang Production Engineer Facebook, Inc

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Running a Successful Facebook Ad Campaigns 7th of April 2020 What will be covered today?

A D A C C O U N T SET-UP PROCESS Facebook: 3 STEP SET-UP 1. Facebook Ad Account 2.

Facebook Basics Hannah Digital Literacy Specialist Skokie Public Library What is Facebook?

FACEBOOK July 12, 2009 JGS of the Conejo Valley and Ventura County Who or What Is Facebook?

TEC Entrepreneurial Communit y Website: www.tecbruins.org Facebook: http://facebook.com/UCLA.TEC

ATTACK: Learning the Distributions of Adversarial Examples for an Improved Black-Box Attack on

The Aspect-Oriented Design of the ++ Parser Framework Puma C/C Matthias Urban Daniel Lohmann

Introduction to Robotics Jianwei Zhang zhang@informatik.uni-hamburg.de Universit at Hamburg

Taking the PUMA E P&L to the next level Stefan Seidel Teamhead PUMA.Safe Ecology

On Stability theory for C 0 -Semigroups and applications Francis Flix Crdova Puma

Formal Methods for Interactive Systems Part 8 Cognitive Architectures Antonio Cerone

A Generic Framework for Interprocedural Analysis of Numerical Properties + Markus Mller-Olm

CCDSC 2016 10/4/2016 Equivalent platforms for unmodified application c o r e Application

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo - PowerPoint PPT Presentation

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Thursday, March 8, 12 Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions Thursday, March

Facebook Exchange Facebook Exchange (FBX) (FBX) Facebook Exchange The Facebook Exchange allows

Facebook Strategies Facebook www.facebook.com Facebook TIPS Idea #1: Share the School Calendar.

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Architectures Architectural styles Software architectures Architectures versus middleware

GETTING STARTED WITH FACEBOOK ADVERTISING 1.Facebook Ads Growth 2.Why theyre popular

Introducing Live for Facebook Available Now (beta) Coming Soon Available On Facebook Mentions

One Trillion Edges: Graph Processing at Facebook-Scale GraphHPC 2015, Moscow Avery Ching Sergey

MySQL Replication and HA at Facebook Part-II Jeff Jiang Production Engineer Facebook, Inc

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Running a Successful Facebook Ad Campaigns 7th of April 2020 What will be covered today?

A D A C C O U N T SET-UP PROCESS Facebook: 3 STEP SET-UP 1. Facebook Ad Account 2.

Facebook Basics Hannah Digital Literacy Specialist Skokie Public Library What is Facebook?

FACEBOOK July 12, 2009 JGS of the Conejo Valley and Ventura County Who or What Is Facebook?

TEC Entrepreneurial Communit y Website: www.tecbruins.org Facebook: http://facebook.com/UCLA.TEC

ATTACK: Learning the Distributions of Adversarial Examples for an Improved Black-Box Attack on

The Aspect-Oriented Design of the ++ Parser Framework Puma C/C Matthias Urban Daniel Lohmann

Introduction to Robotics Jianwei Zhang zhang@informatik.uni-hamburg.de Universit at Hamburg

Taking the PUMA E P&amp;L to the next level Stefan Seidel Teamhead PUMA.Safe Ecology

On Stability theory for C 0 -Semigroups and applications Francis Flix Crdova Puma

Formal Methods for Interactive Systems Part 8 Cognitive Architectures Antonio Cerone

A Generic Framework for Interprocedural Analysis of Numerical Properties + Markus Mller-Olm

CCDSC 2016 10/4/2016 Equivalent platforms for unmodified application c o r e Application

Taking the PUMA E P&L to the next level Stefan Seidel Teamhead PUMA.Safe Ecology