Data in the Cloud Happy 10 th ACM SoCC! Raghu Ramakrishnan CTO for Data, Technical Fellow
ACM SoCC Topics Over the Past 10 Years 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Word clouds courtesy Carlo Curino
ACM SoCC Topics After Filtering “data” and “cloud” 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Word clouds courtesy Carlo Curino
4
Carbon Footprint of Cloud Computing 52% 71% 77% 22% TRADITIONAL TRADITIONAL TRADITIONAL TRADITIONAL more efficient more efficient more efficient more efficient DATACENTER DATACENTER DATACENTER DATACENTER Virtualized Dedicated Large, Large, servers, high- storage, high-end high-end end deployment high-end deployment deployment deployment SharePoint Online Azure Compute Azure Storage Exchange Online Electricity/core-hour Electricity/TB-year Electricity/mailbox-year Electricity/user-year http://download.microsoft.com/download/7/3/9/739BC4AD-A855-436E-961D-9C95EB51DAF9/Microsoft_Cloud_Carbon_Study_2018.pdf 8
AI in Operation & Optimization IoT and big data platforms make it increasingly easy to optimize datacenters IoT telemetry, analytics and ML optimization Predictive maintenance Capacity planning and workload placement Microsoft Confidential 9
Microsoft Confidential 10
Ubiquitous Data
Scale Heterogeneity Data silos Network Many engines latency Elastic compute Many workloads Elastic storage 16
Azure SQL DW Azure SQL DB Analytics-optimized Update-optimized Meta data OPERATIONAL ANALYTICS Meta data RELATIONAL XACT_STATE XACT_STATE Governance Governance NON-RELATIONAL Azure Cosmos DB Spark, Hive, ML … Document model Data Lake Meta data Meta data XACT_STATE Governance Governance
Big Picture: Separation of Compute and State Spark, Hive, ML… Azure SQL DW Azure SQL DB Azure Cosmos DB Data Lake Document model Update-optimized Analytics-optimized Meta data Meta data Meta data Meta data XACT_STATE XACT_STATE XACT_STATE
Microsoft’s Internal Big Data Service Azure Data Lake Store Microsoft’s internal data lake HDFS as a PaaS cloud service A data lake for all teams • Enabling business growth : @Microsoft Microsoft’s serverless Big Data platform • Office productivity revenue (45%YoY)* Good developer tools • Fully aligned with Hadoop ecosystem • Intelligent Cloud (100% YoY)* Batch, Interactive, Streaming, ML • and standards, with full support for Bing search share doubles Hadoop tools and engines as well as Used across Office, Xbox, Azure, • unique Microsoft capabilities Windows, Ads, Bing, Skype, … Migrated to ADLS Production jobs and experimentation • • 1P = 3P • By the numbers J. Zhou et. al., SCOPE: parallel databases 9+ Exabytes of data, 8+ billion files • meet MapReduce, VLDBJ 21(5) 100Ks of physical servers • Millions of interactive queries • R. Ramakrishnan et. Al., Azure Data Lake Huge streaming pipelines • Store, SIGMOD 2017 100Ks of daily batch jobs • 15K+ developers • Apache YARN Federation 300+ teams • MSR/GSL Collaboration
Traditional MPP DW Architecture Meta data Transactions DQE communication channel Data movement channel Compute DMS • SQL • Adaptive cache SSD Cache • Remote storage Snapshot backups Data Premium Standard Log
Cloud-Native Scale-Out, Data Heterogeneity Data and state separated from ▪ compute Data movement channel Fault-tolerant scale-out ▪ Online scaling ▪ Compute Data heterogeneity ES • ▪ SQL • SSD Cache • ➢ Converge DW and Lake Centralized services Remote storage Standard Meta data Distribution-less • Transactions Columnar files
Polaris Concurrency – Workload Aware Scheduling State Machines: A next generation distributed query engine (blend massive scale batch QP with scale up (small scale-out) interactive QP) Guarantees precedence constraints are satisfied • Defines a formal model on how we recover from • failures Task Resource % Workload Task Graph State Machine Execution Demand State Transition 25 5 Workload Tasks 25 Agg 10 5 5 Query 1 Agg 10 5 5 Edges are precedence constraints 15 40 States 5 5 Global Workload Graph that enables for Waiting Execution 15 5 40 workload optimizations across queries Executing 5 15 Failed Query 2 Completed 5 Task-cost Driven Scheduling Resource Aware Task Placement
Scalability: All TPC-H Queries at 1PB Scale! Elastic DQP – Unlimited Scale
P . Antonopoulos, et. al,., Socrates: The New SQL Server in the Cloud. ACM SIGMOD 2019 𝒕𝒋𝒜𝒇𝒑𝒈(𝒆𝒃𝒖𝒃) 𝒕𝒋𝒜𝒇𝒑𝒈(𝒆𝒃𝒖𝒃) 𝒕𝒋𝒜𝒇𝒑𝒈(𝒆𝒃𝒖𝒃) 𝑶 𝑶 𝑶 MSR Collaboration
OLTP Data Warehouses Big Data / Lake NoSQL
Unified Data Suite and Governance Global apps Spark, Hive, ML… SQL Azure Cosmos DB Data Lake Update-optimized Document storage Analytics-optimized Meta data Meta data Meta data Meta data XACT_STATE XACT_STATE XACT_STATE Governance
Big Picture: Must Simplify Usability and Governance Cloud • Elastic compute and storage is transformative • • But compute-storage latency and bandwidth is key challenge Edge blurs cloud/on-prem separation • ML • An integral part of data processing, with a rapidly growing community of its own • Implications for Data Management • Rethink what belongs in a “DBMS”— ML, data governance • Rethink data architectures from the ground up — OLTP/Analytics/HTAP •
A. Agrawal et al., Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML, CIDR 2020. Model Development / Training other data Data Model Tracking Model Model Model featurization Catalogs & Provenance optimization Training offline featurization. Logs & Access deployment Telemetry Control offline online App orchestration policies logic Governance Featurization Live Data Model Scoring policies Model Decisions GSL Collaboration
Unified Governance Big Data and Data warehousing A single pane of glass to… Manage e data lifec ecyc ycle le Data & (collect, clean, publish, discover, curate, …) Operational Ensure e Data a Quality ity & Correctn ctness ess Systems Assess s data compli lian ance ce, privacy acy & protection on BI AI and ML Author r & manage e data policy (access, use, retention, location, sharing) Across Cloud, Edge, On-Prem
Recommend
More recommend