large scale data processing with apache hadoop and
play

Large-scale data processing [with Apache Hadoop] [and friends] [at - PowerPoint PPT Presentation

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum Who's who? Who's who? BiG Grid Dutch NGI Who's who? BiG Grid Dutch NGI SARA National center for


  1. Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum

  2. Who's who?

  3. Who's who? BiG Grid ● Dutch NGI

  4. Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid

  5. Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid Me ● Consultant eScience & Cloud Services ● Lead Hadoop infrastructure ● Tech lead LifeWatch-NL

  6. In this talk ● Working on scale ( @ SARA & BiG Grid ) ● An introduction to Hadoop & MapReduce ● Hadoop @ SARA & BiG Grid

  7. Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

  8. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing , Large-Scale Data Storage , High-Performance Networking , eScience , and Visualization

  9. Different types of computing Parallelism ● Data parallelism ● Task parallelism Architectures ● SIMD: Single Instruction Multiple Data ● MIMD: Multiple Instruction Multiple Data ● MISD: Multiple Instruction Single Data ● SISD: Single Instruction Single Data (Von Neumann)

  10. Parallelism: Amdahl's law

  11. Data parallelism

  12. Compute @ SARA & BiG Grid

  13. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how The Datacenter as your computer (NYT, 14/06/2006)

  14. Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

  15. A bit of history 2002 2004 2004 2006 Nutch* MR/GFS** Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

  16. 2010 - 2012: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy

  17. Core principals ● Scale out, not up ● Move processing to the data ● Process data sequentially, avoid random reads ● Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)

  18. A typical data-parallel problem in abstraction 1.Iterate over a large number of records 2.Extract something of interest 3.Create an ordering in intermediate results 4.Aggregate intermediate results 5.Generate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)

  19. MapReduce Programmer specifies two functions ● map (k, v) → <k', v'>* ● reduce (k', v') → <k', v'>* All values associated with a single key are sent to the same reducer The framework handles the rest

  20. The rest? Scheduling, data distribution, ordering, synchronization, error handling...

  21. An overview of a Hadoop cluster

  22. The ecosystem Apache Pig Hbase Data-flow language Key/value store Hive, HCatalog, Giraph Elephantbird, and Graph processing many, many others...

  23. Data-processing as a commodity ● Cheap Clusters ● Simple programming models ● Easy-to-learn scripting Anybody with the know-how can generate insights!

  24. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge

  25. Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

  26. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * (8 cores / 8 TB storage / 64GB RAM) Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

  27. Architecture

  28. We already offer... Hadoop, Pig We will offer... Hbase, Hive, HCatalog, Oozie, and probably more...

  29. What is it being used for? ● Information Retrieval ● Natural Language Processing ● Machine Learning ● Econometry ● Bioinformatics ● Ecoinformatics ● Collaboration with industry!

  30. Machine learning: Infrawatch, Hollandse Brug

  31. Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large data sensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

  32. NLP & IR ● e.g. ClueWeb: a ~13.4 TB webcrawl ● e.g. Twitter gardenhose data ● e.g. Wikipedia dumps ● e.g. del.ico.us & flickr tags ● Finding named entities: [person company place] names ● Creating inverted indexes ● Piloting real-time search ● Personalization ● Semantic web

  33. How do we embrace Hadoop? ● Parallelism has never been easy… so we teach! ● December 2010: hackathon (~50 participants - full) ● April 2011: Workshop for Bioinformaticians ● November 2011: 2 day PhD course (~60 participants – full) ● June 2012: 1 day PhD course ● The datascientist is still in school... so we fill the gap! ● Devops maintain the system, fix bugs, develop new functionality ● Technical consultants learn how to efficiently implement algorithms ● Users bring domain knowledge ● Methods are developing faster than light (don't quote me :)... so we build the community! ● Netherlands Hadoop User Group

  34. Final thoughts ● Hadoop is the first to provide commodity computing ● Hadoop is not the only ● Hadoop is probably not the best ● Hadoop has momentum ● What degree of diversification of infrastructure should we embrace? ● MapReduce fits surprisingly well as a programming model for data-parallelism ● Where is the data scientist? ● Teach. Teach. Work together.

  35. Any questions? evert.lammerts@sara.nl @eevrt @sara_nl

Recommend


More recommend