PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo Faraboschi, VP and HPE Fellow Artificial Intelligence Lab Hewlett Packard Labs ML4HPC Workshop - March 2020
PICKS AND SHOVELS? During the California Gold Rush (160 years ago), only a few miners struck it rich. The most consistent business came from providing the picks and shovels to the miners. Data prep tools (good old ETL, clean up the mess) Auto-ML tools (help pick the right model) Workflow including complex edge-to-cloud pipelines Post-deployment ML/Ops tools (model verification, profiling, drift detection, explainability, etc.) Underlying data layers (file systems, burst buffers, Source: hitsorynotes.info
TALK OUTLINE Edge-to-cloud AI example: Autonomous Driving AI for IT operations: AI/ops I/O challenges for next-generation accelerators
• • 2021 delivery • 2021 delivery Early 2023 delivery • • More than 1.5 EF/s • More than 1 EF/s More than 2.0 EF/s • • Future Intel Xeon CPU and Intel X e Future AMD EPYC CPU and • Future AMD EPYC CPU and Radeon GPU architecture Radeon GPU • • Slingshot interconnect • Slingshot interconnect Slingshot interconnect • • Mixed AI and HPC workload • Mixed AI and HPC workload AI-assisted mission HPC workload 4
BEYOND EXASCALE: AI FOR SCIENCE Source: AI for science DoE TownHall
SCIENCE CHALLENGES EVERY ASPECT OF AI Source: AI for science DoE TownHall
AI AT HEWLETT PACKARD LABS Enterprise Customers AI Journey How to navigate the AI ecosystem today? AI models, platforms, data pipelines Edge-to-core AI computing, AI for IT operations, How to enable AI at the edge? Swarm Learning (federated) AI for Science (US DoE), combining AI + simulation, How to approach data-driven science? Exascale computing Unconventional accelerators Can compute keep up with AI? (DPE PUMA - analog computing)
THE ENTERPRISE AI JOURNEY
THE ENTERPRISE AI JOURNEY for image and voice applications 50% 30% 20% Early Proof of Concept Production How Decide Scale Data Infrastructure Ethics People Other Use cases Strategy 80% 0% 20% 0%
FROM POCTO PRODUCTION PoC Production Weeks Month Years Never Inference Pipeline Infra Model Pipeline Buildup Optimization deployment integration Management Model Compute Analysis Edge Monitoring Model Testing Optimization Performance Tools Management Data ML Model Data Data Software Transformation CODE Retraining Verification Accelerators Data Data move Data needed Users Access Scalability Federation Ingestion Tools and Machine Resource Container Storage Libraries Management Security Orchestration Performance
BARRIERS TO MOVING AT AI SPEED Unprecedented Data across volumes of data Unpredictable costs multiple silos and capacity needs Operationalizing AI is complex Lack of AI talent and skilled resources Ever-changing, expanding open source ecosystem Scalable infrastructure optimized for AI and ML Data protection, security, governance and privacy
A TYPICAL AI JOURNEY The User selects AI models/Applications 1 • System integrator, ISV, or customer-built Request Orchestration plane to implement: 2 • Cluster and Pipeline deployment 1 • 7 Model/Container/Resource Management • Infrastructure requirements summary CONTENT DELIVERY 3 Compute/networking/storage can be on-premise AI Models and Applications or cloud (multi-cloud) based ISVs Integrators Users 4 Infrastructure SW implements changes and Support and Services 2 6 deployment of HW 5 Orchestration plane leverages Infrastructure SW Orchestration for monitoring Bare Metal Container 6 Application is ready to use A 3 5 7 User work can begin Infrastructure Software Compute Storage Networking A The Orchestration Plane interacts with the Infrastructure plane to determine if HW can be On- premise and multi-cloud optimized 4
DELIVERING ON AI Solving the Business Problem Content Delivery: ISV, integrator, Users Data to Solve a Business Problem App/ Model Data Transformation Machine Learning Code Vertical Use cases Toolchain Plane Tools Tools Analysis Tools Libraries Monitoring Enabling the Business Solution Inference Model Optimization Tools Scalability Optimization Optimization Data Plane Model Model Model Testing Management Retraining Orchestration Plane Application Pipeline Deployment automation Data Data Pipeline Data move Data Ingestion Federation Verification Buildup Bare-Metal Containerized Pipeline Container Users Access Security Environment Environment deployment Orchestration Intelligent Service Management Resource Software Management Infrastructure Software Plane Compute Accelerators Storage Infra Edge Performance needed Performance integration Compute, Storage, Fabric
EDGE-TO-CLOUD AI PIPELINES: AN HAD EXAMPLE
AUTONOMOUS DRIVING
THE GENERIC AD VEHICLE AND ENVIRONMENT Levels of Autonomy Level 2 Level 2+ Level 3 Level 4 Level 1 Level 5 No Steering Hands and Hands 0ff Wheel Eyes Off + Hands on + Eyes Off Hands 0ff steering Mind OFF (Available to + wheel Most take over ) Eyes On functions Lane Fully Fully changes automated Steering automated driving under Acceleration/ driving under Data System Steering ALL Deceleration limited Acceleration/ conditions + conditions Deceleration Steering Lane and places + Acceleration/ Acceleration/De changes Vehicle Control and Communication Lane Autonomous Vehicle Operation Deceleration celeration GPS changes Destination Climate Control RF Uplink Route Planning OOB Storage Infotainment Management Road and Lane Detection Input Data Bluetooth Safety Convenience Processing Wi-Fi Surroundings In-Vehicle Security Detection Monitoring ABS Brake Cameras Controller Vehicle Action Vehicle logistics Powertrain Radar Controller Black-Box Steering Controller 16 Vehicle Manufacturer specific Systems
AUTONOMOUS DRIVING: AI TRAINING DATA PIPELINE Vehicle Data per Data per Fleet Data Rate 8h Shift Shift Size Collecting Moving data (Gbps) (TB) (PB) Data Ground Truth 5 18 80 1.4 Ingestion Station 10 36 80 2.8 Simulation 20 72 80 5.7 30 108 80 8.6 Single Events Data Center Edge Cloud
DATACENTER OPERATIONAL INTELLIGENCE: AI/ OPS
AI-OPS: DATACENTER OPERATIONAL INTELLIGENCE Large number of metrics (thousands) Do not know where to look. Large number of threshold-based rules is not manageable, many false positives. Some anomalies are identifiable in high dimensional space (multiple metrics). Broad range of problems (beyond anomaly detection) Anomaly Detection (single metric / multi-metric) Preventive maintenance Performance prediction Optimization (Digital Twin)
PROBLEMS WE ARE TACKLING #Metrics: thousands #Data points: millions per minute Metric diversity: stationarity, modality, irregularities, sparseness Pre-processing: trend-removal, normalization Algorithm selection Post-processing: information fusion, Solution: build an automated correlation, root-cause analysis end-to-end anomaly detection at scale Optimization: Power Usage Effectiveness
EXAMPLE: EXAMPLE: NREL CO NREL COOL OLING ING TOWER WER VAL ALVE VE FAIL AILURE URE Facility Inlet Temperature (C) Facility Inlet Temperature (C) measurements Anomaly Decision Threshold Number of Anomaly Score Threshold of 19.8C Facility Inlet Temperature Event reported 2015-05-27 10:11:00 Event detect 2015-05-27 10:06:00 Anomaly Score 12.42 Temperature (C) 19.80
AIOPS SUMMARY • Improve data center resiliency, efficiency • Collaboration with NREL • Today: anomaly detection for single/multiple metrics • Tomorrow: predictive analytics, autonomous control • Even with multiple dashboards and monitoring, events can still be missed • Advanced AI/ML data analytics: no need for thresholds, or temporal resolutions • Some event was predicted 5 minutes prior to NREL identifying the event
FEEDING THE BEAST: IO CHALLENGES IN ACCELERATED AI
FEEDING THE DATA TO AI TRAINING WILL BE A BOTTLENECK • Economically infeasible to keep the whole dataset in DRAM during the training process o traversing PBs of data by training over billions of samples • Requires a lot of I/O bandwidth by increasing computation o the challenge is to provide enough bandwidth without overprovisioning capacity Internal data flow for a training task The life-cycle of a dataset for supervised training
SHA SHARED RED STORA ORAGE GE • Inspired by the burst buffer architecture in HPC • Compute nodes (CNs) and data nodes (IONs) connected via a high-speed network (such as Slingshot ) • Fast SSDs in the IONs serve as a performance tier • Size the ratio of CNs / IONs to optimize cost / performance Flexibility in configuration, allocation of resources, provisioning of capacity and bandwidth, and ease of data management and sharing across nodes Interference on the network caused by o the I/O traffic between CNs and DNs, and o communication traffic between CNs when synchronizing during the end of each training iteration Long tail effect:smaller achievable I/O bandwidth on some CNs can slow down overall training
Recommend
More recommend