Sunrise or Sunset: Exploring the Design Space of Big Data Software - PowerPoint PPT Presentation

Sunrise or Sunset: Exploring the Design Space of Big Data Software Stacks HPBDC 2017 3rd IEEE International Workshop on High-Performance Big Data Computing May 29, 2017 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington 1

Panel Tasking • The panel will discuss the following three issues: • Are big data software stacks mature or not? – If yes, what is the new technology challenge? – If not, what are the main driving forces for new-generation big data software stacks? – What chances are provided for the academia communities in exploring the design spaces of big data software stacks? 2 2 Spidal.org

Are big data software stacks mature or not? • The solutions are numerous and powerful – One can get good (and bad) performance in an understandable reproducible fashion – Surely need better documentation and packaging • The problems (and users) are definitely not mature (i.e. not understood, and key issues not agreed) – Many academic fields are just starting to use big data and some are still restricted to small data – e.g. Deep Learning is not understood in many cases outside the well publicized commercial cases (voice, translation, images) • In many areas, applications are pleasingly parallel or involve MapReduce; performance issues are different from HPC – Common is lots of independent Python or R jobs 3 Spidal.org

Why use Spark Hadoop Flink rather than HPC? • Yes if you value ease of programming over performance . – This could be the case for most companies where they can find people who can program in Spark/Hadoop much more easily than people who can program in MPI. – Most of the complications including data, communications are abstracted away to hide the parallelism so that average programmer can use Spark/Flink easily and doesn't need to manage state, deal with file systems etc. – RDD data support very helpful • For large data problems involving heterogeneous data sources such as HDFS with unstructured data, databases such as HBase etc • Yes if one needs fault tolerance for our programs. – Our 13-node Moe “big data” (Hadoop twitter analysis) cluster at IU faces such problems around once per month. One can always restart the job, but automatic fault tolerance is convenient. 4 Spidal.org

Why use HPC and not Spark, Flink, Hadoop? • The performance of Spark, Flink, Hadoop on classic parallel data analytics is poor/dreadful whereas HPC (MPI) is good • One way to understand this is to note most Apache systems deliberately support a dataflow programming model • e.g. for Reduce, Apache will launch a bunch of tasks and eventually bring results back – MPI runs a clever AllReduce interleaved “in-place” tree • Maybe can preserve Spark, Flink programming model but change implementation “under the hood” where optimization important. • Note explicit dataflow is efficient and preferred at coarse scale as used in workflow systems – Need to change implementations for different problems 5 Spidal.org

HPC Runtime versus ABDS distributed Computing Model on Data Analytics Hadoop writes to disk and is slowest ; Spark and Flink spawn many processes and do not support AllReduce directly; MPI does in-place combined reduce/broadcast and is fastest 6 Spidal.org

Multidimensional Scaling MDS Results with Flink, Spark and MPI MDS execution time on 16 nodes with 20 MDS execution time with 32000 processes in each node with varying number points on varying number of nodes. of points Each node runs 20 parallel tasks MDS performed poorly on Flink due to its lack of support for nested iterations. In Flink and Spark the algorithm doesn’t scale with the number of nodes. 7 Spidal.org

Intel Haswell Cluster with Intel KNL Cluster with 1.4GHz Large messages 2.4GHz Processors and 56Gbps Processors and 100Gbps Omni- Infiniband and 1Gbps Ethernet Path and 1Gbps Ethernet Small messages Parallelism of 2 and using 8 Nodes Parallelism of 2 and using 4 Nodes 8 Twitter Heron Streaming Software using HPC Hardware Spidal.org

Knights Landing KNL Data Analytics: Harp, Spark, NOMAD Single Node and Cluster performance: 1.4GHz 68 core nodes Kmeans SGD ALS 2 , 635 1 , 409 80 75 36 1 , 338 30 1 , 400 6 Harp-DAAL-Kmeans 1 , 302 Harp-DAAL-SGD 2 , 500 35 Spark-Kmeans NOMAD-SGD Time Per Iteration (s) Time Per Iteration (s) 1 , 200 26 . 7 Time Per Iteration (s) 5 25 . 1 60 2 , 000 25 30 1 , 000 27 . 9 4 Speedup Speedup Speedup 1 , 473 25 . 7 800 1 , 500 Harp-DAAL-ALS 25 38 19 . 8 40 3 20 36 Spark-ALS 600 17 . 9 987 19 . 9 1 , 000 27 20 1 . 9 1 . 8 2 1 . 7 400 1 . 4 15 1 . 2 20 14 0 . 9 1 500 15 200 10 168 51 . 2 38 . 3 33 . 2 85 67 10 10 0 0 10 10 0 10 0 10 2 4 6 10 20 30 10 20 30 Number of Nodes Number of Nodes Number of Nodes Harp-DAAL-ALS Spark-ALS Harp-DAAL-Kmeans Spark-Kmeans Harp-DAAL-SGD NOMAD-SGD Strong Scaling Multi Node Parallelism Scaling - Omnipath Interconnect 3 , 500 3 , 341 4 , 382 291 Harp-DAAL-ALS Harp-DAAL-Kmeans 300 Harp-DAAL-SGD Spark-ALS Spark-Kmeans NOMAD-SGD 4 , 000 3 , 000 Time Per Iteration (s) Time Per Iteration (s) Time Per Iteration (s) 250 2 , 500 3 , 000 200 2 , 000 1 , 766 1 , 635 149 150 2 , 027 1 , 389 1 , 380 1 , 371 1 , 500 2 , 000 105 100 1 , 000 80 1 , 259 67 59 57 55 1 , 000 500 50 32 555 21 20 19 89 396 44 45 42 44 43 307 163 0 84 44 42 27 36 0 8 16 32 64 128 256 0 8 16 32 64 128 256 Number of Threads 8 16 32 64 128 256 Number of Threads Number of Threads Strong Scaling Single Node Core Parallelism Scaling 9 Spidal.org

Components of Big Data Stack • Google likes to show a timeline • 2002 Google File System GFS ~HDFS • 2004 MapReduce Apache Hadoop • 2006 Big Table Apache Hbase • 2008 Dremel Apache Drill • 2009 Pregel Apache Giraph • 2010 FlumeJava Apache Crunch • 2010 Colossus better GFS • 2012 Spanner horizontally scalable NewSQL database ~CockroachDB • 2013 F1 horizontally scalable SQL database • 2013 MillWheel ~Apache Storm, Twitter Heron • 2015 Cloud Dataflow Apache Beam with Spark or Flink (dataflow) engine • Functionalities not identified: Security, Data Transfer, Scheduling, serverless computing 10 Spidal.org

What is the new technology challenge? • Integrate systems that offer full capabilities – Scheduling – Storage – “Database” – Programming Model (dataflow and/or “in-place” control-flow) and corresponding runtime – Analytics – Workflow – Function as a Service and Event-based Programming • For both Batch and Streaming • Distributed and Centralized ( Grid versus Cluster ) • Pleasingly parallel ( Local machine learning ) and Global machine learning (large scale parallel codes) 11 Spidal.org

What are the main driving forces for new- generation big data software stacks? • Applications ought to drive new-generation big data software stacks but (at many universities) academic applications lag commercial use in big data area and needs are quite modest – This will change and we can expect big data software stacks to become more important • Note University compute systems historically offer HPC and not Big Data Expertise. – We could anticipate users moving to public clouds (away from university systems) but – Users will still want support • Need a Requirements Analysis that builds in application changes that might occur as users get more sophisticated • Need to help ( train ) users to explore big data opportunities 12 Spidal.org

What chances are provided for the academia communities in exploring the design spaces of big data software stacks? • We need more sophisticated applications to probe some of the most interesting areas • But most near term opportunities are in pleasing parallel (often streaming data) areas – Popular technology like DASK http://dask.pydata.org for parallel NumPy is pretty inefficient • Clarify when to use dataflow (and Grid technology) and • When to use HPC parallel computing (MPI) • Likely to require careful consideration of grain size and data distribution • Certain to involve multiple mechanisms (hidden from user) if want highest performance (combine HPC and Apache Big Data Stack) 13 Spidal.org

Sunrise or Sunset: Exploring the Design Space of Big Data Software - PowerPoint PPT Presentation

Sunrise or Sunset: Exploring the Design Space of Big Data Software Stacks HPBDC 2017 3rd IEEE International Workshop on High-Performance Big Data Computing May 29, 2017 gcf@indiana.edu http://www.dsc.soic.indiana.edu/,

Sunrise or Sunset: Exploring the Design Space of Big Data

Sunrise or Sunset: Exploring the Design Space of Big Data

Sunset HS Lacrosse Club Overview www.sunsetlacrosse.com January 7th 2016 SUNSET LACROSSE

Wisconsin Ave Baptist Church & Sunrise of T enley Circle ANC 3E Presentation October 12,

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

SUNRISE Sophie Murray Head of Nutrition and Hydration and Deputy National Chair for the NACC

2015 Sunset Ridge Model Home San Jacinto, California Sunset Ridge K900 K900 3 BEDROOM 2

California Coastal Commission uniquely special + Sunset Beach Local Coastal Plan

Agenda Snapshot of Sunset Resort About Bulgaria Pomorie as destination Sunset

REPORT Park Board Meeting Monday, July 8, 2019 July 7: Symphony at Sunset Sunset Beach Park

CONTRACTOR PERFORMANCE Draft Sunset Rule Changes May 24, 2018 Sunset Bill (SB 312)

paradise refined 2 water sports 3 romantic moments 4 world class spa 5 personalized butler

Welcome! Housing Options Review Project Sunrise 6:56 am Sunset 5:58 pm Agenda: Presentation

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

PUBLI C MEETI NG PUBLI C MEETI NG April 28 April 28- -29, 2009 29, 2009 Governor David A.

Sunrise Communications Group AG Investor presentation September 2017 Sunrise leading fully

Stephens Fall Investment Conference November 8, 2016 Marta R. Stewart Executive Vice President

AU AUTOMAT ATED UI TESTING WITH JUBU BULA INDIVIDUELLE SOFTWARE INDIVIDUELLE SOFTWARE

iii. if linear representation : : { symmetric { symmetric group group of

Challenges for Fast Synthesis Procedures in SMT Andrew Reynolds ARCADE Workshop August 6, 2017

CHAPTER 16: ARGUING Multiagent Systems http://www.csc.liv.ac.uk/mjw/pubs/imas/ Chapter 16 An

CHAPTER 2: INTELLIGENT AGENTS An Introduction to Multiagent Systems

Linking Records in a Dynamic World Pei Li University of Milan Bicocca Joint work w. Xin Luna

Panel Data Analysis Part II Feasible Estimators James J. Heckman University of Chicago Econ

Sunrise or Sunset: Exploring the Design Space of Big Data Software - PowerPoint PPT Presentation

Sunrise or Sunset: Exploring the Design Space of Big Data Software Stacks HPBDC 2017 3rd IEEE International Workshop on High-Performance Big Data Computing May 29, 2017 gcf@indiana.edu http://www.dsc.soic.indiana.edu/,

Sunrise or Sunset: Exploring the Design Space of Big Data

Sunrise or Sunset: Exploring the Design Space of Big Data

Sunset HS Lacrosse Club Overview www.sunsetlacrosse.com January 7th 2016 SUNSET LACROSSE

Wisconsin Ave Baptist Church &amp; Sunrise of T enley Circle ANC 3E Presentation October 12,

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

SUNRISE Sophie Murray Head of Nutrition and Hydration and Deputy National Chair for the NACC

2015 Sunset Ridge Model Home San Jacinto, California Sunset Ridge K900 K900 3 BEDROOM 2

California Coastal Commission uniquely special + Sunset Beach Local Coastal Plan

Agenda Snapshot of Sunset Resort About Bulgaria Pomorie as destination Sunset

REPORT Park Board Meeting Monday, July 8, 2019 July 7: Symphony at Sunset Sunset Beach Park

CONTRACTOR PERFORMANCE Draft Sunset Rule Changes May 24, 2018 Sunset Bill (SB 312)

paradise refined 2 water sports 3 romantic moments 4 world class spa 5 personalized butler

Welcome! Housing Options Review Project Sunrise 6:56 am Sunset 5:58 pm Agenda: Presentation

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

PUBLI C MEETI NG PUBLI C MEETI NG April 28 April 28- -29, 2009 29, 2009 Governor David A.

Sunrise Communications Group AG Investor presentation September 2017 Sunrise leading fully

Stephens Fall Investment Conference November 8, 2016 Marta R. Stewart Executive Vice President

AU AUTOMAT ATED UI TESTING WITH JUBU BULA INDIVIDUELLE SOFTWARE INDIVIDUELLE SOFTWARE

iii. if linear representation : : { symmetric { symmetric group group of

Challenges for Fast Synthesis Procedures in SMT Andrew Reynolds ARCADE Workshop August 6, 2017

CHAPTER 16: ARGUING Multiagent Systems http://www.csc.liv.ac.uk/mjw/pubs/imas/ Chapter 16 An

CHAPTER 2: INTELLIGENT AGENTS An Introduction to Multiagent Systems

Linking Records in a Dynamic World Pei Li University of Milan Bicocca Joint work w. Xin Luna

Panel Data Analysis Part II Feasible Estimators James J. Heckman University of Chicago Econ

Wisconsin Ave Baptist Church & Sunrise of T enley Circle ANC 3E Presentation October 12,