big data where it started where it is where we are going
play

Big Data: Where It Started, Where It Is, Where We Are Going Oliver - PowerPoint PPT Presentation

Big Data: Where It Started, Where It Is, Where We Are Going Oliver Nielsen Pentaho Director Services Solutions, Hitachi Vantara Agenda Big Data / Hadoop History Throw out all pre-conceived ideas and concepts The Dawn of Big Data


  1. Big Data: Where It Started, Where It Is, Where We Are Going Oliver Nielsen Pentaho Director – Services Solutions, Hitachi Vantara

  2. Agenda Big Data / Hadoop History • Throw out all pre-conceived ideas and concepts • The Dawn of Big Data • The Pivot to think bigger, broader, and deliver outcomes! • The landscape of tomorrow

  3. The Space Shuttle and the Chariot • Arguably the worlds most advanced transportation system • Built in Utah by Thiokol. The engineers wanted them to be bigger! But… • Train Tracks - > testing facility -> 4 feet 8.5 inches wide -> • Built by engineers from England that had built tramways, so they used the same gauge. That gauge came from the jigs that were used to build wagons.

  4. The Space Shuttle and the Chariot • That gauge came from the jigs that were used to build wagons. • The wagon wheels were made to be a standard size so that on long-distance trips they could use the same ruts in the roads. • Those wagons were based on the standard axle sizes from roman chariots • Roman chariots were built to accommodate 2 horses pulling that chariot! • So, the space shuttle rocket boosters were not made to engineering specifications due to railroad tracks that are based on the width of two horses behinds! • This story is actually UNTRUE. But… the moral of the story is still the same.

  5. The Space Shuttle and the Chariot Sometimes you must throw out everything you know and start with a blank canvas

  6. The Dawn of Hadoop • 2003 – Google File System – Brin and Page – Write Once File System – Break everything into chunks (64MB at the time) – Spread chunks across different servers (data nodes) – Only made to benefit large file sizes! • 2004 – MapReduce – Simplified Data Processing across a cluster of servers – Parallel, distributed, algorithm’s on data – Map – Filtering, sorting, and business rules in to key, value pairs – Reduce – Aggregating data by key – Uses the Split – Combine – Apply strategy for analysis of data • 2005 - Doug Cutting and Mike Caferella created first Package and named Hadoop

  7. The Dawn of Hadoop • 2006 – Hadoop 0.1.0 released – Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours – Yahoo deploys a 300 server Hadoop cluster in May – Yahoo deploys a 600 server Hadoop cluster in Oct. • 2007 – First Adoptions – Yahoo has 2 1000 node clusters by April – By June – 3 companies are “Powered by Hadoop” – HBASE Introduced- June – Pig Introduced– Built by Yahoo – October • 2008 – Growth – 20 companies now “Powered by Hadoop” – Limited to MapReduce in Java, Pig scripting (beta), Java Developers Cheer!

  8. The Middle Ages of Hadoop and Pentaho

  9. Light Bulb Moments, Science Projects, Frustrations • 2012 - Big Data Is Hard! – YARN replaces MapReduce – Vendors like Pentaho find traction in removing pain points! – SQL on Hadoop (Hive, Impala, Others) – Its all SLOW! – Hadoop Summits gain popularity – 500 people attend – 8 Different File systems now! – HDFS, GlusterFS, Quantcast, Ceph, etc. • 2014 – – Focus on SQL Performance – Storm, Spark, TEZ – Hadoop Summit – 3,200 attendees in San Jose – Hadoop in the Cloud – AWS, Azure, Google Cloud / BigQuery – Continued “all in with Big Data” outlook by Pentaho! – Data Science – R, Python, Scala

  10. So Many Choices, So Little Time • Big Data Is Still Hard! – File Formats, Compression Algorithms, Data Ingest, Data output for Analytics! – All these things have to be considered! • FOCUS ON OUTCOMES!!! – Do not waste time on science projects – Find something that meets the 3 V’s • Volume • Velocity • Variety • The 4th “V” - Vision – You must have a forward looking vision and an outcome you want to achieve! Without that you have no business working with Big Data Solutions right now.

  11. Hadoop and Hitachi Vantara – What’s Next?

  12. The Future Is Bright • Pentaho Adaptive Execution Layer – Remove Logic from Execution engine – Start with Spark! No scala code, no python code. • Future-proof your investment with AEL – What’s coming next? ********* – Flink? • Formerly Stratosphere • processing framework for distributed, high-performing, always-available, and accurate data streaming applications – Apex? • Apex is a Hadoop YARN native platform that unifies stream and batch processing . It processes big data in-motion in a way that is highly scalable, highly performant, fault tolerant, stateful, secure, distributed, and easily operable. • Has a high Level API that may be able to be leveraged by Pentaho/PDI

  13. What’s Next? • Calcite – Calcite is a framework for writing data management systems. It converts queries, represented in relational algebra, into an efficient executable form using pluggable query transformation rules. SQL parser, JDBC driver. Calcite does not store data or have a preferred execution engine. Data formats, execution algorithms, planning rules, operator types, metadata, and cost model are added at runtime as plugins. • Beam – A simple, flexible, and powerful system for distributed data processing at any scale. Beam provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Beam pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Many of the Proposals from Beam have been integrated into Spark 2.0

  14. New Platforms • Current Trends are leaning towards Cloud-based Hadoop Deployments – Easier To Scale – Easier To Manage – Easier To Tune – Specialized Distributions for different workloads (Analytic Queries, Streaming, Iot) • Who do we Work With Already? – Google Cloud Platform – Azure – AWS • Under consideration by Hitachi Vantara – Cloudera Altus – Snowflake

  15. Hitachi Vantara Will Lead the Way! • As new technologies and Apache projects come through the ecosystem, Hitachi Vantara will evaluate which technologies make sense to function as a new Adaptive Execution Engine, or as a plug-in, or integrate with an API. • 2061: IO/Europa/Ganymede

Recommend


More recommend