Apache Hadoop Framework The Nexus of Open Source Innovation Eric Baldeschwieler, CTO, Hortonworks Avik Dey, Director, Hadoop Services, Intel Moderator: Todd Cramer, Director Product Marketing, Intel
Evolution to Open Source Data Management with Scale-out Storage & Processing Processing Style/ Form Factor Date Paradigm Scale Out RDBMS 90s • Reporting / Data Mining • Batch – “sales reports” • High Cost / Isolated use • Sequential SQL queries Scale Multi-core No SQL RDBMS • Model-based discovery 2000s • Batch-ie correlated buying • High Cost / Dept Use pattern • No SQL. parallel analysis • Shared disk/memory Node Node Proprietary MPP/ Node Scale DW Appliance Open Source SW coupled • Real-time- ie recommend engine Today • Unbounded Map Reduce to commodity HW • Process @ storage node Query Node Node Node • Built-in data replication/reliability • Low Cost / Enterprise Use • Shared nothing, in memory • Arrival of vast amounts of Unlimited unstructured data Linear Scale Distributed node addition
Apache Hadoop Evolution Source - Steven Nimmons 2/24-12 2006 2008 2009-10 2011-12 HDFS HBase Flume HCatalog • • • • MapReduce ZooKeeper Avro Bigtop • • • • Pig Whirr Ambari • • • Hive Sqoop Yarn • • • Mahout • Oozie •
Hadoop: What will it take to cross The Chasm? Orgs looking for use cases & ref arch • relative % customers Ecosystem evolving to create a pull market • Enterprises endure 1-3 year adoption cycle • The CHASM Innovators Early Early Late majority, Laggards, technology adopters, majority, conservatives Skeptics enthusiasts visionaries pragmatists time Customers want Customers want technology & performance solutions & convenience & reliability Source: Geoffrey Moore - Crossing the Chasm
Enterprise Big Data Flows Business Unstructured CRM, ERP Data Transactions Web, Mobile & Interactions Point of sale Big Data Log files Platform Exhaust Data Classic Data Integration & ETL Social Media Sensors, Business Dashboards, devices Reports, Intelligence Visualization, & Analytics … DB data Capture Big Data Process Exchange Collect data from all Transform, refine, Interoperate and share 1 2 3 sources structured & aggregate, analyze, data with unstructured report applications/analytics
What changes from POC to large clusters? 5-100 nodes 4000 node “Small cluster” “Hadoop at Scale” Cluster Size Node Node Node • Staff & consultants are dominant • Hardware + Power + Hosting are costs dominant costs • Redundant networks, hardware • Hardware Optimization reliability features save human • Failures are inevitable, Hadoop software capital & support handles this • Need to focus on simplicity • Hadoop operations expertise
Optimizing Hadoop Deployments Address Potential NETWORK Deployment Bottlenecks STORAGE COMPUTE Fast Fabric Benchmark Disk Compute Security & Tuning Write/memory APIs SSDs Hi-tune Encryption Instruction 10GbE Non-volatile Hi-Bench Sets memory
Talk to an Expert: Question & Answer Today’s Experts: • Eric Baldeschwieler, CTO, Hortonworks - @JERIC14 • Avik Dey, Director, Hadoop Services, Intel - @AvikonHadoop Submit your questions: • Ask questions at anytime by pressing the Question tab at the top of the player. Download today’s content: • Located under the attachment tab at the top of the player More information: • www.intel.com/bigdata
Recommend
More recommend