AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An Introduction to Spark Environments Josep Ll. Berral @BSC Intel Academic Education Mindshare Initiative for AI
Presentation Distributed computing using Apache SPARK! • Apache Spark is a framework • for processing data • in a distributed manner • For distributing our experiments and analytics
Introduction “Describe what to execute and let Spark to distribute it for execution”
Introduction to Spark • What is Apache Spark • Cluster Computing Framework • Programming clusters with data parallelism and fault tolerance • Programmable in Java, Scala, Python and R
Motivation for using Spark • Spark schedules data parallelism • User defines the set of operations to be performed • Spark performs an orchestrated execution • Distributed algorithms libraries: • ML, Graphs, Streaming, DB queries d1 exp d2 exp exp data d3 exp
Motivation for using Spark • It works with Hadoop Distributed File System • Taking advantage Distributed File Systems • Bring execution to where data is distributed d1 exp d2 exp exp d3 exp Data in HDFS
Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) Cluster
Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) 2. Link to your distributed File System Cluster DFS
Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app My Cluster Local Session DFS
Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app 4. Let Spark to plan and execute the workflow and data-flow My Cluster Local Session Run! DFS
Introduction to Apache Spark • Distributed Data and Shuffling • Spark takes advantage of data distribution • If operations require to cross data from different places • Shuffling: Data needs to be shared among workers • We must think of it when preparing the analytics . . . d1 r1 r2 r2 . . . r2 r1 d2 r1 Data Processing Keep processing Exchange
Virtualized Environments • Cloud environments • Take advantage of Virtualization/Containers
Virtualized Environments • Cloud environments • Take advantage of Virtualization/Containers Worker Image VM/Container manager: X 2 CPU “Deploy N workers and 1 X 16GB Mem master” X 1TB Disk “Create a virtual network to let them see each other” Master Image X 4 CPU ”Give them a common configuration (master X 32GB Mem can find the workers, Disk X 2TB workers can find the DFS or find the files, ...)”
Summary • What is Spark • Distributined Computing Framework • Spark distributed architecture • Directors and Workers • Distributing experiments and data • Leverage Virtualization • How we can deploy/scale using VMs and Containers
Recommend
More recommend