big data
play

Big Data Compete by asking bigger questions $$$... $ ??? SLA - PowerPoint PPT Presentation

A World of Data Gizillions of mobile Thingsternet transactions Living online Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the Daaaay!! But its not always easy to tame an


  1. A World of Data “ Gizillions ” of mobile “ Thingsternet ” transactions Living online Big Data Compete by asking bigger questions

  2. $$$... $

  3. ???

  4. SLA

  5. Yaaaay – Hadoop to Save the Daaaay!! • But it’s not always easy to tame an elephant…

  6. Introducing “DataCo” WEB SHOP WEB SHOP CUSTOMERS WEB CLIENT BACKEND DATA BASE ~100GB Product and Customer Transaction Data “We don’t really have a big data problem…”

  7. Introducing “DataCo” WEB SHOP WEB SHOP CUSTOMERS WEB CLIENT BACKEND DATA BASE > 6 months? Web App Product and Mobile App IT/Ops and Click Stream Customer Data InfoSec Data Data Transaction Data

  8. Active Archive / Self Serve Ad-hoc BI • Top sold products last 6, 12, and 18 months? SQL Hive Impala HDFS

  9. Using Sqoop to Ingest Data from MySQL • Sqoop is a bi-directional structured data ingest tool • Simple UI in Hue, more commonly used from the shell $ sqoop import -m 12 – connect jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba --password=yow!2014 --table my_cool_table --hive-import --as-parquetfile $ sqoop import-all-tables -m 12 – connect jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba --password=yow!2014 --compression-codec=snappy --as-avrodatafile --warehouse-dir=/user/hive/warehouse

  10. Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored _separate_ from data hive> CREATE EXTERNAL TABLE products > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION 'hdfs:///user/hive/warehouse/products' > TBLPROPERTIES ('avro.schema.url'='hdfs://namenode_dataco/user/examples/products.avsc');

  11. Use Impala via Hue to Query

  12. $$$... $

  13. Correlate Multi-type Data Sets • Top viewed products last 6, 12, and 18 months? SQL Hive Flume Impala HDFS

  14. Ingest Data Using Flume • Pub/sub ingest framework • Flexible multi-level (mini-transformation) pipeline FLUME AGENT Continuously Flume Agent, FLUME FLUME Optional generated events, HDFS, HBase, SOURCE SINK Logic e.g. syslog, tweets Solr, or other destination

  15. Create Hive Tables over Log Data • New use case, new data • Create new tables over semi-structured log data CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) LOCATION '/user/hive/warehouse/original_access_logs'; CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs; exit;

  16. Use Impala and Hue to Query Missing!!! 2 8 5 7 1 6 3 4 9

  17. $$$... $

  18. !!!

  19. Multi-Use-Case Data Hub • Why is sales dropping over the last 3 days? Search Queries Solr Flume HDFS

  20. Create your Index • Create an empty Solr index configuration directory $ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --generate live_logs_dir • Edit the Solr Schema file to have the fields you want to search over … <field name="_version_" type="long" indexed="true" stored="true" multiValued="false" /> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="ip" type="text_general" indexed="true" stored="true"/> <field name="request_date" type="date" indexed="true" stored="true"/> …

  21. Create your Index cont. • Upload your configuration for a collection to ZooKeeper $ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --create live_logs ./live_logs_dir • Tell Solr to start serving up a collection and start indexing data for it $ solrctl --zk <ALL YOUR ZK IPs>/ solr collection --create live_logs -s 4

  22. Flume and Morphline Pipeline

  23. Flume with Morphlines Configured • Configure Flume to use your Morphlines and post parsed data to Solr …. # Describe solrSink agent1.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent1.sinks.solrSink.channel = memoryChannel agent1.sinks.solrSink.batchSize = 1000 agent1.sinks.solrSink.batchDurationMillis = 1000 agent1.sinks.solrSink.morphlineFile = /opt/examples/flume/conf/morphline.conf agent1.sinks.solrSink.morphlineId = morphline agent1.sinks.solrSink.threadCount = 1 …..

  24. Dynamic Search UI in Hue

  25. Shared Storage!!

  26. How Do We Improve Healthcare? Challenges Solution Benefits • Only 3 days’ of • 50GB monitor • Ad-hoc and monitoring data data per week faster insight • 2TB capacity • Reduced capacity • No ability to • Sqoop, Solr, asthma related correlate large Impala, HDFS ICU visits • Total license research data sets fees < 3 • No ability to ad- processor hoc study licenses for environment EDW impact

  27. How Do We Feed The World? Global Warming Changes Conditions How do we improve quality and resistance of crops and seeds in a variety of global and rapidly changing environments?

  28. How Do We Feed The World? Benefits Challenges Solution • Streamlined • Time to market • PB-scale • HBase, HDFS, processes for each new • Time to results product: 5-10 Solr, reduced from years MapReduce, • 1,000+ years to Sqoop, Impala, months!!! scientists … working in silos • Data processing bottlenecks slow development

  29. Solution Challenges Benefits • ~20 nodes • 100-200 B • Ad-hoc insight • 256GB RAM events/month on feature • Real-time multi- servers trends • Flume, Solr, • Significant TTR type event Impala, HDFS correlation reduction • ROI realized in complex the 1 st week • No way to do ad-hoc game analytics

  30. Learn More? • Stop by the Cloudera booth today!  • Play on your own: cloudera.com/live • Get training: http://cloudera.com/content/cloudera/en/training.html • Join the Community: cdh-user@cloudera.org • Connect with me: @EvaAndreasson

  31. Hope You Enjoyed This Talk! Don’t forget to VOTE!!!

  32. Bonus Track…

  33. My Advice for the Road…

  34. Try Something Simple First…

  35. Decide what to Cook!

  36. Collect All Ingredients

  37. Use the Right Tool for the Right Task

  38. Prepare All Ingredients

  39. Don’t Forget the Importance of Visualization!

  40. Challenges Solution Benefits • Tons of • Integration & • Faster, cheaper information storage of multi- genome locked away in structured sequencing • Searchable index medical records experimental & scientific data of variant call • Data access & studies data for • Different sources exploration via biologists to & systems can’t Impala, R, explore “talk” to each HBase, Solr, Hive other

  41. Using Sqoop to Ingest Data from MySQL • View your imported “tables” $ hadoop fs -ls /user/hive/warehouse/ • View all Avro files constituting a table $ hadoop fs -ls /user/hive/warehouse/mytablename/

  42. Hadoop - A New Approach to Data Management Distributed Distributed Schema on Storage Processing Read Active Cost-Efficient Flexible Archive Offload Analytics

  43. The Birth of the Data Lake Hadoop: Storage & Batch Processing

  44. 2006 2007 2008 2009 2010 2011 • • • • • Core Hadoop • Core Hadoop Core Hadoop Core Hadoop Core Hadoop Core Hadoop • • • • Hbase Hbase Hbase Hbase • • • • ZooKeeper ZooKeeper ZooKeeper ZooKeeper • • • • Mahout Mahout Mahout Mahout • Pig • • Pig Pig • • Hive • Hive Hive • • Flume Flume • • Avro Avro 2012 2013 2014 • • Sqoop Sqoop • Bigtop • • • Core Hadoop Core Hadoop Core Hadoop • Oozie • • • Hbase Hbase Hbase • • • ZooKeeper ZooKeeper ZooKeeper • • • Mahout Mahout Mahout • • • Pig Pig Pig • • • Hive Hive Hive • • • Flume Flume Flume A Rapidly Growing • • • Avro Avro Avro • • • Sqoop Sqoop Sqoop • Ecosystem • • Bigtop Bigtop Bigtop • • • Oozie Oozie Oozie • • • Hue Hue Hue • • • Impala Impala Impala • • • Parquet Parquet Parquet • • Solr Solr • • Setnry Sentry • Spark • Kafka

Recommend


More recommend