Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum
Who's who?
Who's who? BiG Grid ● Dutch NGI
Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid
Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid Me ● Consultant eScience & Cloud Services ● Lead Hadoop infrastructure ● Tech lead LifeWatch-NL
In this talk ● Working on scale ( @ SARA & BiG Grid ) ● An introduction to Hadoop & MapReduce ● Hadoop @ SARA & BiG Grid
Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing , Large-Scale Data Storage , High-Performance Networking , eScience , and Visualization
Different types of computing Parallelism ● Data parallelism ● Task parallelism Architectures ● SIMD: Single Instruction Multiple Data ● MIMD: Multiple Instruction Multiple Data ● MISD: Multiple Instruction Single Data ● SISD: Single Instruction Single Data (Von Neumann)
Parallelism: Amdahl's law
Data parallelism
Compute @ SARA & BiG Grid
What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how The Datacenter as your computer (NYT, 14/06/2006)
Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
A bit of history 2002 2004 2004 2006 Nutch* MR/GFS** Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
2010 - 2012: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy
Core principals ● Scale out, not up ● Move processing to the data ● Process data sequentially, avoid random reads ● Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)
A typical data-parallel problem in abstraction 1.Iterate over a large number of records 2.Extract something of interest 3.Create an ordering in intermediate results 4.Aggregate intermediate results 5.Generate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
MapReduce Programmer specifies two functions ● map (k, v) → <k', v'>* ● reduce (k', v') → <k', v'>* All values associated with a single key are sent to the same reducer The framework handles the rest
The rest? Scheduling, data distribution, ordering, synchronization, error handling...
An overview of a Hadoop cluster
The ecosystem Apache Pig Hbase Data-flow language Key/value store Hive, HCatalog, Giraph Elephantbird, and Graph processing many, many others...
Data-processing as a commodity ● Cheap Clusters ● Simple programming models ● Easy-to-learn scripting Anybody with the know-how can generate insights!
Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * (8 cores / 8 TB storage / 64GB RAM) Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
Architecture
We already offer... Hadoop, Pig We will offer... Hbase, Hive, HCatalog, Oozie, and probably more...
What is it being used for? ● Information Retrieval ● Natural Language Processing ● Machine Learning ● Econometry ● Bioinformatics ● Ecoinformatics ● Collaboration with industry!
Machine learning: Infrawatch, Hollandse Brug
Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large data sensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
NLP & IR ● e.g. ClueWeb: a ~13.4 TB webcrawl ● e.g. Twitter gardenhose data ● e.g. Wikipedia dumps ● e.g. del.ico.us & flickr tags ● Finding named entities: [person company place] names ● Creating inverted indexes ● Piloting real-time search ● Personalization ● Semantic web
How do we embrace Hadoop? ● Parallelism has never been easy… so we teach! ● December 2010: hackathon (~50 participants - full) ● April 2011: Workshop for Bioinformaticians ● November 2011: 2 day PhD course (~60 participants – full) ● June 2012: 1 day PhD course ● The datascientist is still in school... so we fill the gap! ● Devops maintain the system, fix bugs, develop new functionality ● Technical consultants learn how to efficiently implement algorithms ● Users bring domain knowledge ● Methods are developing faster than light (don't quote me :)... so we build the community! ● Netherlands Hadoop User Group
Final thoughts ● Hadoop is the first to provide commodity computing ● Hadoop is not the only ● Hadoop is probably not the best ● Hadoop has momentum ● What degree of diversification of infrastructure should we embrace? ● MapReduce fits surprisingly well as a programming model for data-parallelism ● Where is the data scientist? ● Teach. Teach. Work together.
Any questions? evert.lammerts@sara.nl @eevrt @sara_nl
Recommend
More recommend