Distributing Matrix Computations with Spark MLlib Reza Zadeh A - PowerPoint PPT Presentation

Distributing Matrix Computations with Spark MLlib Reza Zadeh

A General Platform Standard libraries included with Spark Spark MLlib Spark SQL GraphX Streaming � machine structured graph learning real-time … Spark Core

Outline Introduction to MLlib Example Invocations Benefits of Iterations: Optimization Singular Value Decomposition All-pairs Similarity Computation MLlib + {Streaming, GraphX, SQL}

Introduction

MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution from AMPLab, UC Berkeley Shipped with Spark since Sept 2013

MLlib: Available algorithms classification: classification: logistic regression, linear SVM, � naïve Bayes, least squares, classification tree regr egression: ession: generalized linear models (GLMs), regression tree collaborative filtering: collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: clustering: k-means|| decomposition: decomposition: SVD, PCA optimization: optimization: stochastic gradient descent, L-BFGS

Example Invocations

Example: K-means

Example: PCA

Example: ALS

Benefits of fast iterations

Optimization At least two large classes of optimization problems humans can solve: - Convex Programs - Spectral Problems (SVD)

Optimization - LR data ¡= ¡spark.textFile(...).map(readPoint).cache() ¡ ¡ w ¡= ¡numpy.random.rand(D) ¡ ¡ for ¡i ¡ in ¡range(iterations): ¡ ¡ ¡ ¡ ¡gradient ¡= ¡data.map(lambda ¡p: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(1 ¡/ ¡(1 ¡+ ¡exp(-‑p.y ¡* ¡w.dot(p.x)))) ¡* ¡p.y ¡* ¡p.x ¡ ¡ ¡ ¡ ¡).reduce(lambda ¡a, ¡b: ¡a ¡+ ¡b) ¡ ¡ ¡ ¡ ¡w ¡-‑= ¡gradient ¡ ¡ print ¡“Final ¡w: ¡%s” ¡% ¡w ¡

Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing partitionBy Neighbors (id, edges) Ranks (id, rank) … join join join

PageRank Results 200 171 ime per iteration (s) Time per iteration (s) Hadoop 150 Basic Spark 100 72 Spark + Controlled 50 Partitioning 23 0

Spark PageRank Generalizes ¡to ¡Matrix ¡Multiplication, ¡opening ¡many ¡algorithms ¡ from ¡Numerical ¡Linear ¡Algebra ¡

Deep Dive: Singular Value Decomposition

Singular Value Decomposition Two cases: Tall and Skinny vs roughly Square computeSVD function takes care of which one to call, so you don’t have to.

SVD selection

Tall and Skinny SVD

Tall and Skinny SVD Gets ¡us ¡ ¡ ¡V ¡and ¡the ¡ singular ¡values ¡ Gets ¡us ¡ ¡ ¡U ¡by ¡one ¡ matrix ¡multiplication ¡

Square SVD via ARPACK Very mature Fortran77 package for computing eigenvalue decompositions � JNI interface available via netlib-java � Distributed using Spark

Square SVD via ARPACK Only needs to compute matrix vector multiplies to build Krylov subspaces The result of matrix-vector multiply is small � The multiplication can be distributed

Deep Dive: All pairs Similarity

Deep Dive: All pairs Similarity Compute via DIMSUM: “Dimension Independent Similarity Computation using MapReduce” Will be in Spark 1.2 as a method in RowMatrix

All-pairs similarity computation

Naïve Approach

Naïve approach: analysis

DIMSUM Sampling

DIMSUM Analysis

Spark implementation

Ongoing Work in MLlib stats library (e.g. stratified sampling, ScaRSR) ADMM LDA General Convex Optimization

MLlib + {Streaming, GraphX, SQL}

MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion Model weights are updated via SGD, thus amenable to streaming More work needed for decision trees

� MLlib + SQL points = context.sql(“select latitude, longitude from tweets”) � model = KMeans.train(points, 10) �

MLlib + GraphX

Future of MLlib

General Linear Algebra CoordinateMatrix RowMatrix BlockMatrix Goal: ¡version ¡1.2 ¡ Local and distributed versions. � Operations in-between. Goal: ¡version ¡1.3 ¡

Research Goal: General Convex Optimization Distribute ¡CVX ¡by ¡ backing ¡CVXPY ¡with ¡ PySpark ¡ ¡ Easy-‑to-‑express ¡ distributable ¡convex ¡ programs ¡ ¡ Need ¡to ¡know ¡less ¡ math ¡to ¡optimize ¡ complicated ¡ objectives ¡

Spark and ML Spark has all its roots in research, so we hope to keep incorporating new ideas!

Distributing Matrix Computations with Spark MLlib Reza Zadeh A - PowerPoint PPT Presentation

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries included with Spark Spark MLlib Spark SQL GraphX Streaming machine structured graph learning real-time Spark Core Outline Introduction to

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

COMP9313: Big Data Management Classification and PySpark MLlib PySpark MLlib MLlib is

Overview of PySpark MLlib BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

The Three Dimensions of Scalable Machine Learning Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com

Distributed Machine Learning on Spark Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline

MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H.

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Cassandra on RocksDB Dikang Gu Software Engineer @ Facebook Agenda 1. Motivation 2. Approaches

GStreamer on Android Who are we? A short Introduction to GStreamer Pipeline based multimedia

Going Na)ve: Using a Large-Scale Analysis of Android Apps to Create a Prac)cal Na)ve-Code

New directions in attosecond physics Katalin Varj ELI-ALPS, Hungary Winter College on Extreme

Android: forensics and reverse engineering Raphal Rigo - ANSSI 26/11/2010 Agence nationale de

SCAPI: The Secure Computation API Yehuda Lindell Bar-Ilan University, Israel TCC 2014 Rump

Regularizing Part Geometry Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu

Sambuz

Useful Links

Newsletter

Mail Us

Distributing Matrix Computations with Spark MLlib Reza Zadeh A - PowerPoint PPT Presentation

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries included with Spark Spark MLlib Spark SQL GraphX Streaming machine structured graph learning real-time Spark Core Outline Introduction to

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

COMP9313: Big Data Management Classification and PySpark MLlib PySpark MLlib MLlib is

Overview of PySpark MLlib BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara

GPU Enabled Spark MLlib Lingyun Li &amp; Lei Yao CS 848 University of Waterloo Outline

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

The Three Dimensions of Scalable Machine Learning Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com

Distributed Machine Learning on Spark Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline

MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H.

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Cassandra on RocksDB Dikang Gu Software Engineer @ Facebook Agenda 1. Motivation 2. Approaches

GStreamer on Android Who are we? A short Introduction to GStreamer Pipeline based multimedia

Going Na)ve: Using a Large-Scale Analysis of Android Apps to Create a Prac)cal Na)ve-Code

New directions in attosecond physics Katalin Varj ELI-ALPS, Hungary Winter College on Extreme

Android: forensics and reverse engineering Raphal Rigo - ANSSI 26/11/2010 Agence nationale de

SCAPI: The Secure Computation API Yehuda Lindell Bar-Ilan University, Israel TCC 2014 Rump

Regularizing Part Geometry Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu

Sambuz

Useful Links

Newsletter

Mail Us

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline