Stratosphere v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1
Release Preview Official release coming end of November Hands on sessions today with the latest code snapshot 2
New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs Bulk (batch-to-batch in memory) and Incremental (Delta Updates) o Automatic caching and cross-loop optimizations o • Runs on top of YARN (Hadoop Next Gen) • Various deployment methods VMs, Debian packages, EC2 scripts, ... o • Many usability fixes and of bugfixes 3
Stratosphere System Stack Sky Java Sky Scala Meteor ... API API Stratosphere Optimizer Stratosphere Runtime Cluster Direct EC2 YARN Manager Local Storage HDFS S3 ... Files 4
MapReduce It is nice and good, but... Very verbose and low level. Only usable by system programmers. Everything slightly more complex must result in a cascade of jobs. Loses performance and optimization potential. Map Red. Map Map Red. Map Red. Red. Map Red. Map Red. Map Map Red. Map Red. Map 5
SQL (or Hive or Pig) It is nice and good, but... • Allow you to do a subset of the tasks efficiently and elegantly • What about the cases that do not fit SQL? o Custom types o Custom non-relational functions (they occur a lot!) o Iterative Algorithms Machine learning, graph analysis • How does it look to mix SQL with MapReduce? 6
SQL (or Hive or Pig) is nice and good, but... FROM ( • Program Fragmentation FROM pv_users MAP pv_users.userid, pv_users.date • Impedance Mismatch USING 'map_script' AS dt, uid • Breaks optimization CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count; A = load 'WordcountInput.txt'; Hive B = MAPREDUCE wordcount.jar store A into 'inputDir ‘ load 'outputDir' as (word:chararray, count: int) Pig 'org.myorg.WordCount inputDir outputDir'; C = sort B by count; 7
Sky Language MapReduce style functions (Map, Reduce, Join, CoGroup, Scala Embedded Language Cross, ...) Relational Set Operations (filter, map, group, join, Optimizer aggregate, ...) Write like a programming Database / UDF Runtime language, execute like a database... 8
Sky Language Add a bit of " languages and compilers " sauce to the database stack 9
Scala API by Example • The classical word count example val input = TextFile(textInput) val words = input flatMap { line => line.split("\\W+") } val counts = words groupBy { word => word } count() 10
Scala API by Example • The classical word count example In-situ data source Transformation function val input = TextFile(textInput) val words = input flatMap { line => line.split("\\W+") } val counts = words groupBy { word => word } count() Group by entire data Count per group type (the words) 11
Scala API by Example • Graph Triangles (Friend-of-a-Friend problem) Recommending friends, finding important connections o • 1) Enumerate candidate triads • 2) Close as triangles 12
Scala API by Example case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base1: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } 13
Scala API by Example Custom Data Types In-situ data source case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base1: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } 14
Scala API by Example Non-relational library function case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base2: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) Non-relational val byDegree = vertices map { projectToLowerDegree } function val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } Relational isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } Join 15
Scala API by Example case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base2: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) Key val byDegree = vertices map { projectToLowerDegree } References val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } 16
Optimizing Programs • Program optimization happens in two phases 1. Data type and function code analysis inside the Scala Compiler 2. Relational-style optimization of the data flow Type Analyze Generate Code Parser Program Checker Data Types Glue Code Generation Scala Compiler Finalize Create Instantiate Instantiate Optimize Execution Glue Code Schedule Stratosphere Optimizer Run Time 17
Type Analysis/Code Gen • Types and Key Selectors are mapped to flat schema • Generated code for interaction with runtime Primitive Types, Int, Double, Single Value Array[String], Arrays, Lists ... Tuples / (a: Int, b: Int, c: String) (a: Int, b: Int, c: String) Tuples class T(x: Int, y: Long) Classes (x: Int, y: Long) Nested Recursively class T(x: Int, y: Long) (x: Int, y: Long) Types class R(id: String, value: T) (id:String, x:Int, y:Long) flattened Tuples recursive class Node(id: Int, left: Node, (id:Int, left:BLOB, (w/ BLOB for types right: Node) right:BLOB) recursion) 18
Optimization case class Order(id: Int, priority: Int, ...) case class Item(id: Int, price: double, ) val orders = DataSource(...) case class PricedOrder(id, priority, price) val items = DataSource(...) val filtered = orders filter { ... } val prio = filtered join items where { _.id } isEqualTo { _.id } map {(o,li) => PricedOrder(o.id, o.priority, li.price)} val sales = prio groupBy {p => (p.id, p.priority)} aggregate ({_.price},SUM) Grp/Agg Grp/Agg (0,1) Join (0) = (0) Join sort (0,1) sort (0) partition(0) ( ∅) Filter partition(0) Filter Items Orders Items Orders 19
Iterative Programs • Many programs have a loop and make multiple passes over the data o Machine Learning algorithms iteratively refine the model o Graph algorithms propagate information one hop by hop Loop outside the system Client Step Step Step Step Step Iteration Loop inside the system 20
Why Iterations Algorithms that need iterations • Clustering (K- Means, …) o Gradient descent o Page-Rank o Logistic Regression o Path algorithms on graphs (shortest paths, centralities, …) o Graph communities / dense sub-components o Inference (believe propagation) o … o All the hot algorithms for building predictive models 21
Two Types of Iterations Incremental Iterations Bulk Iterations (aka. Workset Iterations) Result Result Iterative Iterative State Function Function Initial Dataset Initial Initial 22 Workset Solutionset
Iterations inside the System 1400 1200 # Vertices (thousands) 1000 Runtime (secs) 800 6000 600 Naïve 5000 400 Incremental 200 4000 0 3000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Iteration 2000 1000 Computations performed in each iteration for connected 0 communities of a social graph Twitter Webbase (20) 23
Iterative Program (Scala) def step = (s: DataSet [Vertex], ws: DataSet [Vertex]) => { val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } } val delta = s join minNeighbor where { _.id } isEqualTo { _.id } flatMap { (c,o) => if (c.component < o.component) Some(c) else None } val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from} map { (v, e) => Vertex(e.to, v.component) } (delta, nextWs) } val components = vertices. iterateWithWorkset (initialWorkset, {_.id}, step) 24
Iterative Program (Scala) Define Step function def step = (s: DataSet [Vertex], ws: DataSet [Vertex]) => { val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } } val delta = s join minNeighbor where { _.id } isEqualTo { _.id } flatMap { (c,o) => if (c.component < o.component) Some(c) else None } val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from} map { (v, e) => Vertex(e.to, v.component) } (delta, nextWs) } val components = vertices. iterateWithWorkset (initialWorkset, {_.id}, step) Return Delta and Invoke Iteration next Workset 25
Iterative Program (Java) 26
Graph Processing in Stratosphere 27
Recommend
More recommend