AI and Predictive Analytics in Data-Center Environments Performance & Executing Experiments Josep Ll. Berral @BSC Intel Academic Education Mindshare Initiative for AI
Introduction “We have to choose where/how to run AI algorithms & experiments ”
Introduction • Algorithms have computing/data requirements • Computation Resources • CPU, Memory, GPUs, accelerators, Machines to run our algorithms storage, ... • Time to run • Train models, infer new data, … • Data Data to feed our algorithms • What we are modeling and imitating
Resources • Algorithms have computing requirements Algorithms Resources Mem CPU Machine Mem CPU CPU Mem Disk Time
Resources • Algorithms also have data requirements Algorithms In Memory Mem Data In Disks Disk From the Disk From Users Network Disk
Environment We need a COMPUTING ENVIRONMENT!
Environment • Local Machines • Own computer • Workstations at work • ...
Environment • Cluster Machines • DataCenter at work • DataCener at labs • Scientific grids • ... Your Machine Data-Center / Cluster Submission Execution
Environment • The Cloud • Data-Centers from Resource Providers Data-Centers Experiments Resource Provider
Environment • Choosing the environment X 1 CPU X 4 CPU X 16GB Mem X 256 GB Mem X 2 CPU X 128GB Mem X 32 CPU X 2GB Mem X 64 CPU X 128 GB Mem X 16 CPU X 4GB Mem
Environment • Choosing the environment X 128 CPU X 1TB Mem X 1 CPU X 4 CPU X ∞ CPU X 4GB Mem X 4GB Mem X ∞ GB Mem X 2 CPU X 2GB Mem X 16 CPU X 16GB Mem X 64 CPU X 16 CPU X 128GB Mem X 16GB Mem
Performance “ Work x Time ” “ Capacity to Progress ”
Performance • Performance is (usually) linked to Resources CPU Mem Disk Resources Performance
Performance • Performance is (usually) linked to Resources CPU Mem Disk Resources (Up to a point) (Not always) (Not necessarily linear) (etc...) Performance
Performance • Computing Environment • Pool of Resources • Algorithms/Apps/Experiments • Require resources X 1 CPU Mem OK! CPU X 32GB Mem Mem CPU CPU Mem Disk
Some Theory Little’s Law • Little’s Law: L = λ W • “ Arrival rate x Dedicated time per input = Average load ”
Some Theory Little’s Law • Relation between • Received load • Resources/Time required CPU Mem • Average load // Required resources Disk CPU CPU CPU
Little’s Law: L = λ W • Example 100 Exp / hour • we submit λ = 100 experiments / hour • Exps. take an average of W = 0.5 hours ½ hour • [with 1 CPU per exp.] CPU 1 CPU • Average number of exps. on our system: L = 50 exps • [avg. 50 CPUs in use] BUSY What we expect CPU x 50 (what we need)
Little’s Law Demo!
Resource Limits Arrival rate: λ = 100 experiments / hour • 100 CPUs Exps. take: W = 0.5 hours [1 CPU per exp.] Average exps. in: L = 50 exps CPU x 100 100 Jobs 0 Jobs 100 CPUs BUSY 0 CPUs BUSY x 100 CPU Running Running 0 30 60
Resource Limits Arrival rate: λ = 100 experiments / hour • 50 CPUs Exps. take: W = 0.5 hours [1 CPU per exp.] (What Little’s Law indicated) Average exps. in: L = 50 exps CPU CPU 50 Jobs 50 Jobs x 100 50 CPUs BUSY 50 CPUs BUSY Running Running x 50 CPU 50 Jobs 0 Jobs Queue Queue 0 30 60
Resource Limits Arrival rate: λ = 100 experiments / hour • 25 CPUs Exps. take: W = 0.5 hours [1 CPU per exp.] • Less than needed! Average exps. in: L = 50 exps CPU CPU 25 Jobs 25 Jobs x 100 25 CPUs BUSY 25 CPUs BUSY Running Running x 25 CPU 75 Jobs 50 Jobs Queue Queue 0 30 60
Resource Limits Arrival rate: λ = 100 experiments / hour • 25 CPUs Exps. take: W = 0.5 hours [1 CPU per exp.] • Less than needed! Average exps. in: L = 50 exps CPU CPU CPU 25 Jobs 25 Jobs 25 Jobs x 100 25 CPUs BUSY 25 CPUs BUSY 25 CPUs BUSY x 25 Running Running Running CPU 75 Jobs 50 Jobs 125 Jobs Queue Queue Queue 0 30 60 90 x 100
Throughput • Throughput: outcome per time unit • E.g. experiments finished per hour • E.g. data-sets processed per minute • E.g. data points trained per second Throughput Resources Limit Load
Resource Competition • Systems do not always have “queues” • Processes (applications) compete in the system • For using the CPU • For getting some memory • For accessing the disk and network (I/O)
Recommend
More recommend