Design Patterns Leveraging Spark in PDI Chris Skirde Pentaho Director of Sales Engineering, Hitachi Vantara Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara
Quiz Time! • What is Spark? A. A good way to start a fire. B. Necessary for a well running internal combustion engine. C. Fast and general purpose engine for large-scale data processing. D. All of the above. • True or False, Pentaho supports Spark? • Who is using Spark today (with or without Pentaho)?
Agenda • Introduction to Spark • Common design patterns • How to leverage Spark with Pentaho
Introduction to Spark • Why are we interested? • What is it really? • What’s been done?
Spark Application Architecture PDI/Server Daemon
What Do Those Applications Have in Common?
Common Design Patterns • Filter/Organize • Join • Sum • Transform/Enrich • Query • Machine Learning/Data Science
Filter/ Organize
Join
Sum (and Other Aggregations)
Transform/Enrich • Any step you like!
Query – Easy! • Cloudera use Hive-on-Spark with Hive2 • Hortonworks use SparkSQL via Simba
Machine Learning/Data Science
Recap What we covered today: • Reviewed what Spark is and why organizations are adopting it • Discussed several common data integration design patterns • Linked those design patterns to Pentaho features for you to try
Questions?
Next Steps Want to learn more? • “Meet the Experts” Matt Casters and Mark Hall! • Adaptive Execution Layer http://www.pentaho.com/blog/introducing-adaptive- execution-layer-spark-architecture • SQL on Spark http://www.pentaho.com/blog/operationalize-spark-big-data- newest-enhancements
Recommend
More recommend