Testing it out: minimum per group example from map_reduce import runtask documents = [ ( 'drama' , 200), ( 'education' , 100), ( 'action' , 20), ( 'thriller' , 20), ( 'drama' , 220), ( 'education' , 150), ( 'action' , 10), ( 'thriller' , 160), ( 'drama' , 140), ( 'education' , 160), ( 'action' , 20), ( 'thriller' , 30) ] # Provide a mapping function of the form mapfunc(value) # Must yield (k,v) pairs def mapfunc ( value ): genre , pages = value yield ( genre , pages ) # Provide a reduce function of the form reducefunc(key, list_of_values) # Must yield (k,v) pairs def reducefunc ( key , values ): yield ( key , min ( values )) # Pass your input list, mapping and reduce functions runtask ( documents , mapfunc , reducefunc ) 34
Back to Hadoop On Hadoop, MapReduce tasks are written using Java Bindings for Python and other languages exist as well, but Java is the “native” environment Java program is packages as a JAR archive and launched using the command: hadoop jar myfile.jar ClassToRun [args...] hadoop jar wordcount.jar RunWordCount /input/dataset.txt /output/ 35
Back to Hadoop public static class MyReducer extends Reducer < Text , IntWritable , Text , IntWritable > { public void reduce ( Text key , Iterable < IntWritable > values , Context context ) throws IOException , InterruptedException { int sum = 0; IntWritable result = new IntWritable (); for ( IntWritable val : values ) { sum += val . get (); } result . set ( sum ); context . write ( key , result ); } } 36
Back to Hadoop hadoop jar wordcount.jar WordCount /users/me/dataset.txt /users/me/output/ 37
Back to Hadoop $ hadoop fs -ls /users/me/output Found 2 items -rw-r—r-- 1 root hdfs 0 2017-05-20 15:11 /users/me/output/_SUCCESS -rw-r—r-- 1 root hdfs 2069 2017-05-20 15:11 /users/me/output/part-r-00000 $ hadoop fs -cat /users/me/output/part-r-00000 and 2 first 1 is 3 line 2 second 1 the 2 this 3 38
Back to Hadoop MapReduce tasks can consist of more than mappers and reducers Partitioners, Combiners, Shufflers, and Sorters 39
MapReduce Constructing MapReduce programs requires “a certain skillset” in terms of programming (to put it lightly) One does not simply implement Random Forest on MapReduce There’s a reason why most tutorials don’t go much further than counting words Tradeoffs in terms of speed, memory consumption, and scalability Big does not mean fast Does your use case really align with a search engine? 40
YARN How is a MapReduce program coordinated amongst the different nodes in the cluster? In the former Hadoop 1 architecture, the cluster was managed by a service called the JobTracker TaskTracker services lived on each node and would launch tasks on behalf of jobs (instructed by the JobTracker) The JobTracker would serve information about completed jobs JobTracker could still become overloaded, however! In Hadoop 2, MapReduce is split into two components The cluster resource management capabilities have become YARN, while the MapReduce- specific capabilities remain MapReduce 41
YARN YARN’s setup has a couple advantages First, by breaking up the JobTracker into a few different services, it avoids many of the scaling issues facing Hadoop 1 It also makes it possible to run frameworks other than MapReduce on a Hadoop cluster. For example, Impala can also run on YARN and share resources on a cluster with MapReduce I.e. can be used for all sorts of coordination of tasks Will be an advantage once we move away from Hadoop (see later) Even then, people also proposed alternatives to Yarn (see later) I.e. a general coordination and resource management framework 42
So… Hadoop? Standard Hadoop: definitely not a turn-key solution for most environments Just a big hard drive and a way to do scalable MapReduce? In a way which is not fun to program at all? As such, many implementations and vendors also mix-in a number of additional projects such as: HBase: a distributed database which runs on top of the Hadoop core stack (no SQL, just MapReduce) Hive: a data warehouse solution with SQL like query capabilities to handle data in the form of tables Pig: a framework to manipulate data stored in HDFS without having to write complex MapReduce programs from scratch Cassandra: another distributed database Ambari: a web interface for managing Hadoop stacks (managing all these other fancy names) Flume: a framework to collect and deal with streaming data intakes Oozie: a more advanced job scheduler that cooperates with YARN Zookeeper: a centralized service for maintaining configuration information, naming (a cluster on its own) Sqoop: a connector to move data between Hadoop and relational databases Atlas: a system to govern metadata and its compliance Ranger: a centralized platform to define, administer and manage security policies consistently across Hadoop components Spark: a computing framework geared towards data analytics 43
So… Hadoop? 44
SQL on Hadoop 45
The first letdown “ From the moment a new distributed data store gets popular, the next question will be how to run SQL on top of it… What do you mean it’s a file system? How do we query this thing? We “ need SQL! 2008: the first release of Apache Hive, the original SQL-on-Hadoop solution Rapidly became one of the de-facto tools included with almost all Hadoop installations Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server It also offers a command line client, Java APIs and JDBC drivers, which made the project wildly successful and quickly adapted by all organizations which were quickly beginning to realize that they’d taken a step back from their traditional data warehouse setups in their desire to switch to Hadoop as soon as possible SELECT genre , SUM( nrPages ) FROM books --\ GROUP BY genre -- > convert to MapReduce job ORDER BY genre --/ 46
There is (was?) also HBase The first database on Hadoop Native database on top of Hadoop No SQL, own get/put/filter operations Complex queries as MapReduce jobs hbase(main):009:0> scan 'users' ROW COLUMN+CELL seppe column=email:, timestamp=1495293082872, value=seppe.vandenbroucke@kuleuven.be seppe column=name:first, timestamp=1495293050816, value=Seppe seppe column=name:last, timestamp=1495293067245, value=vanden Broucke 1 row(s) in 0.1170 seconds hbase(main):011:0> get 'users', 'seppe' COLUMN CELL email: timestamp=1495293082872, value=seppe.vandenbroucke@kuleuven.be name:first timestamp=1495293050816, value=Seppe name:last timestamp=1495293067245, value=vanden Broucke 4 row(s) in 0.1250 seconds 47
There is (was?) also Pig Another way to ease the pain of writing MapReduce programs Still not very easy though People still wanted good ole SQL timesheet = LOAD 'timesheet.csv' USING PigStorage ( ',' ); raw_timesheet = FILTER timesheet by $0>1; timesheet_logged = FOREACH raw_timesheet GENERATE $0 AS driverId, $2 AS hours_logged, $3 AS miles_logged; grp_logged = GROUP timesheet_logged by driverId; sum_logged = FOREACH grp_logged GENERATE group as driverId, SUM ( timesheet_logged.hours_logged ) as sum_hourslogged, SUM ( timesheet_logged.miles_logged ) as sum_mileslogged; 48
Hive 2008: the first release of Apache Hive, the original SQL-on-Hadoop solution Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server SELECT genre , SUM( nrPages ) FROM books --\ GROUP BY genre -- > convert to MapReduce job ORDER BY genre --/ Hive is handy… but SQL-on-Hadoop technologies are not perfect implementations of relational database management systems: Sacrifice on features such as speed, SQL language compatibility Support for complex joins lacking For Hive, the main draw back was its lack of speed Because of the overhead incurred by translating each query into a series of map-reduce jobs, even the simplest of queries can consume a large amount of time Big does not mean fast 49
So… without MapReduce? For a long time, companies such as Hortonworks were pushing behind the development of Hive, mainly by putting efforts behind Apache Tez , which provides a new backend for Hive, no longer based on the map-reduce paradigm but on directed-acyclic-graph pipelines In 2012, Cloudera, another well-known Hadoop vendor, introduced their own SQL-on-Hadoop technology as part of their “Impala” stack Cloudera also opted to forego map-reduce completely, and instead uses its own set of execution daemons, which have to be installed along Hive-compatible datanodes. It offers SQL-92 syntax support, a command line client, and ODBC drivers Much faster than a standard Hive installation, allowing for immediate feedback after queries, hence making them more interactive Today: Apache Impala is open source It didn’t take long for other vendors to take notice of the need for SQL-on-Hadoop, and in recent years, we saw almost every vendor joining the bandwagon and offering their own query engines (IBM’s BigSQL platform or Oracle’s Big Data SQL, for instance) Some better, some worse But… 50
Hype meets reality “ In a tech startup industry that loves its shiny new objects, the term “Big Data” is in the unenviable position of sounding increasingly “3 years ago” “ – Matt Turck Hadoop was created in 2006! It’s now been more than a decade since Google’s papers on MapReduce Interest in the concept of “Big Data” reached fever pitch sometime between 2011 and 2014 Big Data was the new “black”, “gold” or “oil” There’s an increasing sense of having reached some kind of plateau 2015 was probably the year when people started moving to AI and its many related concepts and flavors: machine intelligence, deep learning, etc. Today, we’re in the midst of a new “AI summer” (with it’s own hype as well) 51
Hype meets reality Big Data wasn’t a very likely candidate for the type of hype it experienced in the first place Big Data, fundamentally, is… plumbing There’s a reason why most map-reduce examples don’t go much further than counting words The early years of the Big Data phenomenon were propelled by a very symbiotic relationship among a core set of large Internet companies Fast forward a few years, and we’re now in the thick of the much bigger, but also trickier, opportunity: adoption of Big Data technologies by a broader set of companies Those companies do not have the luxury of starting from scratch Big Data success is not about implementing one piece of technology (like Hadoop), but instead requires putting together a collection of technologies, people and processes 52
Today Today, the field of the data aspect has stabilized: the storage and querying aspect has found a good marriage between big data techniques, speed, a return to relational data bases and NoSQL-style scalability E.g. Amazon Redshift, Snowflake, CockroachDB, Presto… “ Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Facebook uses Presto for interactive queries “ against several internal data stores, including their 300PB data warehouse (We’ll revisit this later when talking about NoSQL) 53
Big Analytics? “What do you mean it’s a file system and some MapReduce? How do we query this thing?” We need SQL! Hive is too slow! Can we do it without MapReduce? Most managers worth their salt have realized that Hadoop-based solutions might not be the right fit Proper cloud-based databases might be But the big unanswered question right now: “ “ How to use Hadoop for machine learning and analytics? Or rather: “ “ How to support distributed analytics? 54
Big Analytics? It turns out that MapReduce was never very well suited for analytics Extremely hard to convert techniques to a map-reduce paradigm Slow due to lots of in-out swapping to HDFS Ask the Mahout project, they tried Slow for most “online” tasks… Querying is nice, but… we just end up with business intelligence dashboarding and pretending we have big data? “2015 was the year of Apache Spark” Bye bye, Hadoop! Spark has been embraced by a variety of players, from IBM to Cloudera-Hortonworks Spark is meaningful because it effectively addresses some of the key issues that were slowing down the adoption of Hadoop: it is much faster (benchmarks have shown Spark is 10 to 100 times faster than Hadoop’s MapReduce), easier to program, and lends itself well to machine learning 55
Spark 56
Time for a spark Just as Hadoop was perhaps not the right solution to satisfy common querying needs, it was also not the right solution for analytics In 2015, another project, Apache Spark, entered the scene in full with a radically different approach Spark is meaningful because it effectively addresses some of the key issues that were slowing down the adoption of Hadoop: it is much faster (benchmarks have shown Spark is 10 to 100 times faster than Hadoop’s MapReduce), easier to program, and lends itself well to machine learning 57
Spark Apache Spark is a top-level project of the Apache Software Foundation, it is an open-source distributed general-purpose cluster computing framework with an in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL Spark’s speed, simplicity, and broad support for existing development environments and storage systems make it popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time The project supporting Spark’s ongoing development is one of Apache’s largest and most vibrant, with over 500 contributors from more than 200 organizations responsible for code in the current software release 58
So what do we throw out? The resource manager (YARN)? We’re still running on a cluster of machines Spark can run on top of YARN, but also Mesos (an alternative resource manager), or even in standalone mode The data storage (HDFS)? Again, Spark can work with a variety of storage systems Google Cloud Amazon S3 Apache Cassandra Apache Hadoop (HDFS) Apache HBase Apache Hive Flat files (JSON, Parquet, CSV, others) 59
So what do we throw out? One thing that we do “kick out” is MapReduce Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning (aha!) How? Apache Spark replaces the MapReduce paradigm with an advanced DAG execution engine that supports cyclic data flow and in-memory computing A smarter way to distribute jobs over machines! Note the similarities with previous projects such as Dask… 60
Spark’s building blocks 61
Spark core This is the heart of Spark, responsible for management functions such as task scheduling Spark core also implements the core abstraction to represent data elements: the Resilient Distributed Dataset (RDD) The Resilient Distributed Dataset is the primary data abstraction in Apache Spark Represents a collection of data elements It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse-grained sets of data Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes Once data is loaded into an RDD, two types of operations can be carried out: Transformations, which create a new RDD by changing the original through processes such as mapping, filtering, and more Actions, such as counts, which measure but do not change the original data 62
RDDs are distributed, fault-tolerant, efficient 63
RDDs are distributed, fault-tolerant, efficient Note that an RDD represents a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel Any sort of element collection: a collection of text lines, a collection of single words, a collection of objects, a collection of images, a collection of instances, … The only feature provided is automatic distribution and task management over this collection Through transformations and actions: do things with the RDD The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node Transformations are said to be lazily evaluated: they are not executed until a subsequent action has a need for the result Where possible, these RDDs remain in memory, greatly increasing the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes 64
RDDs are distribted, fault-tolerant, efficient 65
Writing Spark-based programs Similar as with MapReduce, an application using the RDD framework can be coded in Java or Scala and packaged as a JAR file to be launched on the cluster However, Spark also provides an interactive shell interface (Spark shell) to its cluster environment And also exposes APIs to work directly with the RDD concept in a variety of languages Scala Java Python R SQL 66
Spark shell (pyspark) PySpark is the “driver program”: runs on the client and will set up a “SparkContext” (a connection to the Spark cluster) >>> textFile = sc . textFile ( "README.md" ) # sc is the SparkContext # textFile is now an RDD (each element represents a line of text) >>> textFile . count () # Number of items in this RDD 126 >>> textFile . first () # First item in this RDD u'# Apache Spark' # Chaining together a transformation and an action: # How many lines contain "Spark"? >>> textFile . filter (lambda line : "Spark" in line ). count () 15 67
SparkContext SparkContext sets up internal services and establishes a connection to a Spark execution environment Data operations are not executed on your machine: the client sends them to be executed by the Spark cluster! No data is loaded in the client… unless you’d perform a .toPandas() 68
Deploying an application Alternative to the interactive mode: from pyspark import SparkContext # Set up context ourselves sc = SparkContext ( "local" , "Simple App" ) logData = sc . textFile ( "README.md" ) numAs = logData . filter (lambda s : 'a' in s ). count () numBs = logData . filter (lambda s : 'b' in s ). count () print ( "Lines with a: %i, lines with b: %i" % ( numAs , numBs )) sc . stop () Execute using: /bin/spark-submit MyExampleApp.py Lines with a: 46, Lines with b: 23 69
More on the RDD API So what can we do with RDDs? Transformations Actions map(func) reduce(func) filter(func) count() flatMap(func) first() mapPartitons(func) take(n) sample(withReplacement, fraction, takeSample(withReplacement, n) seed) saveAsTextFile(path) union(otherRDD) countByKey() intersection(otherRDD) foreach(func) distinct() groupByKey() reduceByKey(func) sortByKey() join(otherRDD) 70
Examples https://github.com/wdm0006/DummyRDD A test class that walks like and RDD, talks like an RDD but is actually just a list No real Spark behind it Nice for testing and learning, however from dummy_spark import SparkContext , SparkConf sconf = SparkConf () sc = SparkContext ( master = '' , conf = sconf ) # Make an RDD from a Python list: a collection of numbers rdd = sc . parallelize ([1, 2, 3, 4, 5]) print ( rdd . count ()) print ( rdd . map (lambda x : x **2). collect ()) 71
Examples: word count from dummy_spark import SparkContext , SparkConf sconf = SparkConf () sc = SparkContext ( master = '' , conf = sconf ) # Make an RDD from a text file: collection of lines text_file = sc . textFile ( "kuleuven.txt" ) counts = text_file . flatMap (lambda line : line . split ( " " )) \ . map (lambda word : ( word , 1)) \ . reduceByKey (lambda a , b : a + b ) print ( counts ) 72
Examples: filtering from dummy_spark import SparkContext , SparkConf sconf = SparkConf () sc = SparkContext ( master = '' , conf = sconf ) rdd = sc . parallelize ( list ( range (1, 21))) print ( rdd . filter (lambda x : x % 3 == 0). collect ()) 73
SparkSQL, DataFrames and Datasets 74
These RDDs still “feel” a lot like MapReduce… Indeed, many operations are familiar: map , reduce , reduceByKey , … But remember: the actual execution is more optimized However, from the perspective of the user, this is still very low-level Nice if you want low-level control to perform transformation and actions on your dataset Or when your data is unstructured, such as streams of text Or you actually want to manipulate your data with functional programming constructs Or you don’t care about imposing a schema, such as columnar format But what if you do want to work with tabular structured data… like a data frame? 75
SparkSQL Like Apache Spark in general, SparkSQL is all about distributed in-memory computations SparkSQL builds on top of Spark Core with functionality to load and query structured data using queries that can be expressed using SQL, HiveQL, or through high-level API’s similar to e.g. Pandas (called the “DataFrame” and “Dataset” API’s in Spark) At the core of SparkSQL is the Catalyst query optimizer Since Spark 2.0, Spark SQL is the primary and feature-rich interface to Spark’s underlying in-memory distributed platform (hiding Spark Core’s RDDs behind higher-level abstractions) 76
SparkSQL # Note the difference: SparkSession instead of SparkContext from pyspark.sql import SparkSession spark = SparkSession . builder\ . appName ( "Python Spark SQL example" ) \ . getOrCreate () # A Spark "DataFrame" df = spark . read . json ( "people.json" ) df . show () # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+ df . printSchema () # root # |-- age: long (nullable = true) # |-- name: string (nullable = true) 77
SparkSQL df . select ( "name" ). show () # +-------+ # | name| # +-------+ # |Michael| # | Andy| # | Justin| # +-------+ df . select ( df [ 'name' ], df [ 'age' ] + 1). show () # +-------+---------+ # | name|(age + 1)| # +-------+---------+ # |Michael| null| # | Andy| 31| # | Justin| 20| # +-------+---------+ 78
SparkSQL df . filter ( df [ 'age' ] > 21). show () # +---+----+ # |age|name| # +---+----+ # | 30|Andy| # +---+----+ df . groupBy ( "age" ). count (). show () # +----+-----+ # | age|count| # +----+-----+ # | 19| 1| # |null| 1| # | 30| 1| # +----+-----+ 79
SparkSQL # Register the DataFrame as a SQL temporary view df . createOrReplaceTempView ( "people" ) sqlDF = spark . sql ( "SELECT * FROM people" ) sqlDF . show () # +----+-------+ # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+ 80
DataFrames Like an RDD, a DataFrame is an immutable distributed collection of data elements Extends the “free-form” elements by imposing that every element is organized as a set of value into named columns, e.g. (age=30, name=Seppe) Imposes some additional structure on top of RDDs Designed to make large data sets processing easier This allows for an easier and higher-level abstraction Provides a domain specific language API to manipulate your distributed data (see examples above) Makes Spark accessible to a wider audience Finally, much more in line to what data scientists are actually used to 81
DataFrames pyspark.sql.SparkSession : Main entry point for DataFrame and SQL functionality pyspark.sql.DataFrame : A distributed collection of data grouped into named columns pyspark.sql.Row : A row of data in a DataFrame pyspark.sql.Column : A column expression in a DataFrame pyspark.sql.GroupedData : Aggregation methods, returned by DataFrame.groupBy() pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values) pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality pyspark.sql.functions : List of built-in functions available for DataFrame pyspark.sql.types : List of data types available pyspark.sql.Window : For working with window functions 82
DataFrames Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns: agg(*exprs) : Aggregate on the entire DataFrame without groups columns : Returns all column names as a list corr(col1, col2, method=None) : Calculates the correlation of two columns count() : Returns the number of rows in this DataFrame cov(col1, col2) : Calculate the sample covariance for the given columns crossJoin(other) : Returns the cartesian product with another DataFrame crosstab(col1, col2) : Computes a pair-wise frequency table of the given columns describe(*cols) : Computes statistics for numeric and string columns distinct() : Returns a new DataFrame containing the distinct rows in this DataFrame. drop(*cols) : Returns a new DataFrame that drops the specified column dropDuplicates(subset=None) : Return a new DataFrame with duplicate rows removed dropna(how='any', thresh=None, subset=None) : Returns new DataFrame omitting rows with null values fillna(value, subset=None) : Replace null values 83
DataFrames Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns: filter(condition) : Filters rows using the given condition; where() is an alias for filter() first() : Returns the first row as a Row foreach(f) : Applies the f function to all rows of this DataFrame groupBy(*cols) : Groups the DataFrame using the specified columns head(n=None) : Returns the first n rows intersect(other) : Return a intersection with other DataFrame join(other, on=None, how=None) : Joins with another DataFrame, using the given join expression orderBy(*cols, **kwargs) : Returns a new DataFrame sorted by the specified column(s) printSchema() : Prints out the schema in the tree format randomSplit(weights, seed=None) : Randomly splits this DataFrame with the provided weights replace(to_replace, value, subset=None) : Returns DataFrame replacing a value with another value select(*cols) : Projects a set of expressions and returns a new DataFrame toPandas() : Returns the contents of this DataFrame as a Pandas data frame union(other) : Return a new DataFrame containing union of rows in this frame and another frame 84
DataFrames Can be loaded in from: Parquet files Hive tables JSON files CSV files (as of Spark 2) JDBC (to connect with a database) AVRO files (using “spark-avro” library or built-in in Spark 2.4) Normal RDDs (given that you specify or infer a “schema”) Can also be converted back to a standard RDD 85
SparkR Implementation of the Spark DataFrame API for R An R package that provides a light-weight frontend to use Apache Spark from R Way of working very similar to dplyr Can convert R data frames to SparkDataFrame objects df <- as.DataFrame ( faithful ) groupBy ( df , df $ waiting ) %>% summarize ( count = n ( df $ waiting )) %>% head (3) ## waiting count ##1 70 4 ##2 67 1 ##3 69 2 86
Datasets Spark Datasets is an extension of the DataFrame API that provides a type-safe, object-oriented programming interface Introduced in Spark 1.6 Like DataFrames, Datasets take advantage of Spark’s optimizer by exposing expressions and data fields to a query planner Datasets extend these benefits with compile-time type safety: meaning production applications can be checked for errors before they are run A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation Core idea: where a DataFrame represented a collection of Rows (which a number of named Columns), a Dataset represents a collection of typed objects (with their according typed fields) which can be converted from and to table rows 87
Datasets Since Spark 2.0, the DataFrame APIs has merged with the Datasets APIs, unifying data processing capabilities across libraries Because of this unification, developers now have fewer concepts to learn or remember, and work with a single high-level and type-safe API called Dataset However, DataFrame as a name is still used: a DataFrame is a Dataset[Row], so a collection of generic Row objects 88
Datasets Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API Dataset represents a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java Consider DataFrame as an alias for Dataset[Row], where a Row represents a generic untyped JVM object Since Python and R have no compile-time type-safety, there’s only the untyped API, namely DataFrames Language Main Abstraction Scala Dataset[T] & DataFrame (= Dataset[Row]) Java Dataset[T] Python DataFrame (= Dataset[Row]) R DataFrame (= Dataset[Row]) 89
Datasets Benefits: Static typing and runtime-type safety: both syntax and analysis errors can now be caught during compilation of our program High-level abstraction and custom view into structured and semi-structured data Ease-of-use of APIs with structure Performance and optimization For us R and Python users, we can continue using DataFrames knowing that they are built on Dataset[Row] Most common use case anyways (A more deteailed example will be posted in background information for those interested) 90
MLlib 91
MLlib MLlib is Spark’s machine learning (ML) library Its goal is to make practical machine learning scalable and easy Think of it as a “scikit-learn”-on-Spark Provides: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and load algorithms, models, and Pipelines Utilities: linear algebra, statistics, data handling, etc. 92
MLlib As of Spark 2.0, the primary Machine Learning API for Spark is the DataFrame-based API in the spark.ml package Before: spark.mllib was RDD-based Not a very helpful way of working MLlib still supported the RDD-based API in spark.mllib Since Spark 2.3, MLlib’s DataFrames-based API have reached feature parity with the RDD-based API After reaching feature parity, the RDD-based API will be deprecated The RDD-based API is expected to be removed in Spark 3.0 Why: DataFrames provide a more user-friendly API than RDDs 93
MLlib Gradient-boosted tree regression Classification Survival regression Logistic regression Isotonic regression Decision tree classifier Clustering Random forest classifier Gradient-boosted tree classifier K-means Multilayer perceptron classifier Latent Dirichlet allocation (LDA) One-vs-Rest classifier (One-vs-All) Bisecting k-means Naive Bayes Gaussian Mixture Model (GMM) Regression Recommender systems Linear regression Collaborative filtering Generalized linear regression Validation routines Decision tree regression Random forest regression 94
MLlib example from pyspark.ml.classification import LogisticRegression training = spark . read . format ( "libsvm" ). load ( "data.txt" ) lr = LogisticRegression ( maxIter =10, regParam =0.3, elasticNetParam =0.8) lrModel = lr . fit ( training ) print ( "Coefs: " + str ( lrModel . coefficients )) print ( "Intercept: " + str ( lrModel . intercept )) from pyspark.ml.clustering import KMeans dataset = spark . read . format ( "libsvm" ). load ( "data/data.txt" ) kmeans = KMeans (). setK (2). setSeed (1) model = kmeans . fit ( dataset ) predictions = model . transform ( dataset ) centers = model . clusterCenters () print ( "Cluster Centers: " ) for center in centers : print ( center ) 95
Conclusions so Far 96
Spark versus… Spark: a high-performance in-memory data-processing framework Has been widely adopted and still one of the main computing platforms today Versus: MapReduce (a mature batch-processing platform for the petabyte scale): Spark is faster, better suited in an online, analytics setting, implements data frame and ML concepts and algorithms Apache Tez: “aimed at building an application framework which allows for a complex directed- acyclic-graph of tasks for processing data” Hortonworks: Spark is a general purpose engine with APIs for mainstream developers, while Tez is a framework for purpose-built tools such as Hive and Pig Cloudera was rooting for Spark, Hortonworks for Tez (a few years ago…) Today: Tez is out! (Hortonworks had to also adopt Spark, and merged with Cloudera) Apache Mahout: “the goal is to build an environment for quickly creating scalable performant machine learning applications” A simple and extensible programming environment and framework for building scalable algorithms A wide variety of premade algorithms for Apache Spark, H2O, Apache Flink Before: also “MapReduce all things” approach Kind of an extension to Spark Though most of the algorithms also in MLlib… so not that widely used any more! 97
Spark versus… One contender that still is very much in the market is H2O (http://www.h2o.ai/) “ H2O is an open source, in-memory, distributed, fast, and scalable machine “ learning and predictive analytics platform Core is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. - The data is read in parallel and is distributed across the cluster and stored in memory in a columnar format in a compressed way H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP They also had the idea of coming up with a better “MapReduce” engine Based on a distributed Key-value store In-memory map/reduce Can work on top of Hadoop (YARN) or standalone Though not as efficient as Spark’s engine 98
H2O 99
H2O However, H2O was quick to realize the benefits of Spark, and the role they could play: “customers want to use Spark SQL to make a query, feed the results into H2O Deep Learning to build a model, make their predictions, and then use the results again in Spark” “Sparkling Water” 100
Recommend
More recommend