Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit - - PowerPoint PPT Presentation

simplifying big data with apache crunch
SMART_READER_LITE
LIVE PREVIEW

Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit - - PowerPoint PPT Presentation

Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit Semantic Chart Search Medical Alerting System Cloud Based EMR Population Health Management Problem moves from scaling architecture ... Problem moves from not only scaling


slide-1
SLIDE 1

Simplifying Big Data with Apache Crunch

Micah Whitacre @mkwhit

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

Semantic Chart Search Cloud Based EMR Medical Alerting System Population Health Management

slide-12
SLIDE 12

Problem moves from scaling architecture...

slide-13
SLIDE 13

Problem moves from not only scaling architecture... To how to scale the knowledge

slide-14
SLIDE 14
slide-15
SLIDE 15

Battling the 3 V’s

slide-16
SLIDE 16

Battling the 3 V’s

Daily, weekly, monthly uploads

slide-17
SLIDE 17

Battling the 3 V’s

Daily, weekly, monthly uploads 60+ different data formats

slide-18
SLIDE 18

Battling the 3 V’s

Daily, weekly, monthly uploads 60+ different data formats Constant streams for near real time

slide-19
SLIDE 19

Battling the 3 V’s

Daily, weekly, monthly uploads 2+ TB of streaming data daily 60+ different data formats Constant streams for near real time

slide-20
SLIDE 20

Normalize Data Apply Algorithms Load Data for Displays

HBase Solr Vertica Avro CSV Vertica HBase

Population Health

slide-21
SLIDE 21

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-22
SLIDE 22

M a p p e r R e d u c e r

slide-23
SLIDE 23

Struggle to fit into single MapReduce job

slide-24
SLIDE 24

Struggle to fit into single MapReduce job Integration done through persistence

slide-25
SLIDE 25

Struggle to fit into single MapReduce job Integration done through persistence Custom impls of common patterns

slide-26
SLIDE 26

Struggle to fit into single MapReduce job Integration done through persistence Custom impls of common patterns Evolving Requirements

slide-27
SLIDE 27

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV Anonymize Data Avro Prep for Bulk Load HBase

slide-28
SLIDE 28

Easy integration between teams Focus on processing steps Shallow learning curve Ability to tune for performance

slide-29
SLIDE 29

Apache Crunch

Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation through fns (not job)

slide-30
SLIDE 30

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-31
SLIDE 31

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

Processing Pipeline

slide-32
SLIDE 32

Pipeline

Programmatic description of DAG Supports lazy execution MapReduce, Spark, Memory Implementations indicate runtime

slide-33
SLIDE 33

Pipeline pipeline = MemPipeline.getIntance(); Pipeline pipeline = new MRPipeline(Driver.class, conf); Pipeline pipeline = new SparkPipeline(sparkContext, “app”);

slide-34
SLIDE 34

Source

Reads various inputs At least one required per pipeline Custom implementations Creates initial collections for processing

slide-35
SLIDE 35

Source

Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

slide-36
SLIDE 36

pipeline.read( From.textFile(path));

slide-37
SLIDE 37

pipeline.read( new TextFileSource(path,ptype));

slide-38
SLIDE 38

PType<String> ptype = …; pipeline.read( new TextFileSource(path,ptype));

slide-39
SLIDE 39

PType

Hides serialization Exposes data in native Java forms Avro, Thrift, and Protocol Buffers Supports composing complex types

slide-40
SLIDE 40

Multiple Serialization Types Serialization Type = PTypeFamily Can’t mix families in single type Avro & Writable available Can easily convert between families

slide-41
SLIDE 41

PType<Integer> intTypes = Writables.ints(); PType<String> stringType = Avros.strings(); PType<Person> personType = Avros.records(Person.class);

slide-42
SLIDE 42

PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);

slide-43
SLIDE 43

PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);

slide-44
SLIDE 44

PType<String> ptype = …; PCollection<String> strings = pipeline.read( new TextFileSource(path, ptype));

slide-45
SLIDE 45

PCollection

Immutable Not created only read or transformed Unsorted Represents potential data

slide-46
SLIDE 46

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-47
SLIDE 47

Process Reference Data PCollection<String> PCollection<RefData>

slide-48
SLIDE 48

DoFn

Simple API to implement Location for custom logic Transforms PCollection between forms Processes one element at a time

slide-49
SLIDE 49

For each item emits 0:M items FilterFn - returns boolean MapFn - emits 1:1

slide-50
SLIDE 50

DoFn API

class ExampleDoFn extends DoFn<String, RefData>{ ... }

Type of Data In Type of Data Out

slide-51
SLIDE 51

public void process (String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data); } Type of Data In Type of Data Out

slide-52
SLIDE 52

PCollection<String> refStrings PCollection<RefData> refs = refStrings.parallelDo(fn, Avros.records(RefData.class));

slide-53
SLIDE 53

PCollection<String> dataStrs... PCollection<RefData> refs = dataStrs.parallelDo(diffFn, Avros.records(Data.class));

slide-54
SLIDE 54

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-55
SLIDE 55

Hmm now I need to join... We need a PTable But they don’t have a common key?

slide-56
SLIDE 56

PTable<K, V>

Immutable & Unsorted Variation PCollection<Pair<K, V>> Multimap of Keys and Values Joins, Cogroups, Group By Key

slide-57
SLIDE 57

class ExampleDoFn extends DoFn<String, RefData>{ ... }

slide-58
SLIDE 58

class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{ ... }

slide-59
SLIDE 59

PCollection<String> refStrings PTable<String, RefData> refs = refStrings.parallelDo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

slide-60
SLIDE 60

PTable<String, RefData> refs…; PTable<String, Data> data…;

slide-61
SLIDE 61

data.join(refs); (inner join)

slide-62
SLIDE 62

PTable<String, Pair<Data, RefData>> joinedData = data.join(refs);

slide-63
SLIDE 63

Joins

right, left, inner, outer Mapside, BloomFilter, Sharded Eliminates custom impls

slide-64
SLIDE 64

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-65
SLIDE 65

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-66
SLIDE 66

FilterFn API

class MyFilterFn extends FilterFn<...>{ ... }

Type of Data In

slide-67
SLIDE 67

public boolean accept (... value){ return value > 3; }

slide-68
SLIDE 68

PCollection<Model> values = …; PCollection<Model> filtered = values.filter(new MyFilterFn());

slide-69
SLIDE 69

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-70
SLIDE 70

PTable<String,Model> models = …;

Keyed By PersonId

slide-71
SLIDE 71

PTable<String,Model> models = …; PGroupedTable<String, Model> groupedModels = models.groupByKey();

slide-72
SLIDE 72

PGroupedTable<K, V>

Immutable & Sorted PCollection<Pair<K, Iterable<V>>>

slide-73
SLIDE 73

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-74
SLIDE 74

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-75
SLIDE 75

PCollection<Person> persons = …;

slide-76
SLIDE 76

PCollection<Person> persons = …; pipeline.write(persons, To.avroFile(path));

slide-77
SLIDE 77

PCollection<Person> persons = …; pipeline.write(persons, new AvroFileTarget(path));

slide-78
SLIDE 78

Target

Persists PCollection At least one required per pipeline Custom implementations

slide-79
SLIDE 79

Target

Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

slide-80
SLIDE 80

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

slide-81
SLIDE 81

Pipeline pipeline = …; ... pipeline.write(...); PipelineResult result = pipeline.done(); Execution

slide-82
SLIDE 82

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

Map Reduce Reduce

slide-83
SLIDE 83

Tuning

GroupingOptions/ParallelDoOptions Scale factors Tweak pipeline for performance

slide-84
SLIDE 84

Focus on the transformations Smaller learning curve Functionality first Less fragility

slide-85
SLIDE 85

Extend pipeline for new features Iterate with confidence Integration through PCollections

slide-86
SLIDE 86

Links

http://crunch.apache.org/

http://www.quora.com/Apache-Hadoop/What-are-the- differences-between-Crunch-and-Cascading http://dl.acm.org/citation.cfm?id=1806596.1806638