ai and predictive analytics in data center environments
play

AI and Predictive Analytics in Data-Center Environments Distributed - PowerPoint PPT Presentation

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An Introduction to Spark Environments Josep Ll. Berral @BSC Intel Academic Education Mindshare Initiative for AI Presentation Distributed computing


  1. AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An Introduction to Spark Environments Josep Ll. Berral @BSC Intel Academic Education Mindshare Initiative for AI

  2. Presentation Distributed computing using Apache SPARK! • Apache Spark is a framework • for processing data • in a distributed manner • For distributing our experiments and analytics

  3. Introduction “Describe what to execute and let Spark to distribute it for execution”

  4. Introduction to Spark • What is Apache Spark • Cluster Computing Framework • Programming clusters with data parallelism and fault tolerance • Programmable in Java, Scala, Python and R

  5. Motivation for using Spark • Spark schedules data parallelism • User defines the set of operations to be performed • Spark performs an orchestrated execution • Distributed algorithms libraries: • ML, Graphs, Streaming, DB queries d1 exp d2 exp exp data d3 exp

  6. Motivation for using Spark • It works with Hadoop Distributed File System • Taking advantage Distributed File Systems • Bring execution to where data is distributed d1 exp d2 exp exp d3 exp Data in HDFS

  7. Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) Cluster

  8. Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) 2. Link to your distributed File System Cluster DFS

  9. Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app My Cluster Local Session DFS

  10. Introduction to Apache Spark • Cluster Computing Framework 1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app 4. Let Spark to plan and execute the workflow and data-flow My Cluster Local Session Run! DFS

  11. Introduction to Apache Spark • Distributed Data and Shuffling • Spark takes advantage of data distribution • If operations require to cross data from different places • Shuffling: Data needs to be shared among workers • We must think of it when preparing the analytics . . . d1 r1 r2 r2 . . . r2 r1 d2 r1 Data Processing Keep processing Exchange

  12. Virtualized Environments • Cloud environments • Take advantage of Virtualization/Containers

  13. Virtualized Environments • Cloud environments • Take advantage of Virtualization/Containers Worker Image VM/Container manager: X 2 CPU “Deploy N workers and 1 X 16GB Mem master” X 1TB Disk “Create a virtual network to let them see each other” Master Image X 4 CPU ”Give them a common configuration (master X 32GB Mem can find the workers, Disk X 2TB workers can find the DFS or find the files, ...)”

  14. Summary • What is Spark • Distributined Computing Framework • Spark distributed architecture • Directors and Workers • Distributing experiments and data • Leverage Virtualization • How we can deploy/scale using VMs and Containers

Recommend


More recommend