The Apache Hadoop Ecosystem Doug Cutting Cloudera & Apache
Context: exponential for decades! ● abundance of ○ computing & storage ○ generated data (8ZB in '15) ● peta-scale is now affordable (kMGTPEZY) ○ petabytes ○ petahertz ● traditional data tech doesn't scale well ● more data provides greater value ● time for a new approach
New Hardware Approach Traditional Big Data ● exotic hardware ● commodity HW ○ big central servers ○ racks of pizza boxes ○ SAN ○ Ethernet ○ RAID ○ JBOD ● hardware reliability ● unreliable HW ● expensive ● cost effective ● limited scalability ● scales further
New Software Approach Traditional Big Data ● monolithic ● distributed ○ centralized storage ○ storage & compute ○ RDBMS nodes ● schema first ● raw data ● proprietary ● open source
The Ecosystem is the System ● Hadoop has become the kernel ○ of the distributed operating system for Big Data ○ a de facto industry standard ● No one uses the kernel alone ● A collection of projects at Apache
Open Source at Apache ● no strategic agenda ○ quality is emergent ● community based ○ diverse organizations collaborating voluntarily ○ decisions by consensus ○ transparent ● allows competing projects ○ survival of fittest ● a loose federation of projects ○ permits evolution ● insures against vendor lock-in ○ can't buy Apache
Typical adoption pattern ● Idea that's impractical without Hadoop. ● Build Hadoop-based proof of concept. ● Move initial application to production. ● Add more datasets and users. ○ removing silos in organizations ○ permitting easy experiments on real data Snowballs into institution's central repository for ● analysis ● data processing
How can you use Hadoop? ● What data are you ignoring? ○ How can you use it? ● How can you combine your data with others?
Thanks! Questions? Visit Cloudera at booth 700.
Recommend
More recommend