DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder & CTO
THE DREAM DATA WAREHOUSE (CIRCA 2012) Unlimited and Store all Extreme No compromises Instant Scaling your data simplicity full fledge Data Warehouse No data silos Structured and No management Full support for ACID semi-structured tasks, offered as a transactions with 10x faster for the service read consistency same price, no Petabyte scale over provisioning at very low cost Fast out-of-box with ANSI SQL, RBAC no tuning knobs
WHY THEN? OUR VIEW OF THE CLOUD… § Storage became dirt cheap Design for § Flat network offered uniform abundance bandwidth and not § Single core performance 20x 20x stalled scarcity of resources § Data warehouse and analytic workload are mostly CPU bound
THREE PILLARS Multi-cluster shared Immutable Scalable Multi-Tenant data Architecture Storage Service Leverage cloud elasticity Extremely fast response Self-tuning, self-healing and pay only what you use time at scale Transparent upgrade 20x 20x Instant scale Fine grain vertical and horizontal pruning on any Service architecture Performance isolation column designed for availability, durability and security Real-time Data sharing Automatically applied to any data (structured and semi- structured)
ARCHITECTURE
AN ARCHITECTURE BUILT FOR THE CLOUD Traditional Architectures Shared-disk Shared-nothing Multi-cluster, shared data Shared storage Decentralized, local storage Centralized, scale-out storage Multiple, independent compute clusters Single cluster Single cluster
MULTI-CLUSTER, SHARED DATA ARCHITECTURE ETL & Data Loading No data silos Storage decoupled from compute Virtual Warehouse Any data Data Science Native for structured & semi- Finance structured Virtual Virtual Warehouse Virtual Warehouse Unlimited scalability Warehouse Along many dimensions Low cost Compute on demand Databases Instantly cloning Clone Isolate prod from dev & qa Virtual Virtual Warehouse Warehouse Marketing Dev, Test, QA Highly available 11 9’s durability, 4 9’s availability Virtual Warehouse Dashboards
VIRTUAL WAREHOUSE How to allow concurrent workloads run without impacting each other? One or more MPP compute cluster Virtual Virtual Virtual Virtual Unit of fault and performance isolation warehouse A warehouse B warehouse C warehouse D Use multiple warehouses to segregate ETL Transformation SQL BI workload Resizable on the fly SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache Able to access data in any database Transparently caches data accessed Transaction manager synchronizes data access Automatic suspend when idle and resume when needed
MULTI-CLUSTER WAREHOUSE LEVERAGE ABUNDANCE OF COMPUTE RESOURCES Query Automatically scales compute Query resources based on concurrent usage Query Single virtual warehouse of multiple compute clusters Query scheduler Queries are load balanced across the clusters in a virtual warehouse Split across availability zones for high Cluster 1 Cluster 2 Cluster 3 availability Virtual Warehouse Group
IN THE REAL-WORLD 50% < 1s Interactive Continuous 85% < 2s S3 Dashboard Loading (4TB/day) 95% < 5s <5min SLA Virtual Warehouse Virtual Warehouse Auto Scale – X-Large x 5 Medium Reporting ETL & (Segmented) Maintenance Prod DB Virtual Warehouse Virtual Warehouse 2X-Large Large 4 trillion rows 3+ petabyte raw 8x compression ratio 25M+ micro-partitions
SCALABLE IMMUTABLE STORAGE
STORAGE IMMUTABILITY Accumulates immutable data over time Well supported by all cloud vendor object stores Allow separation of storage and compute resources Enable workload scalability Heavily optimized for read mostly workload Natural fit for analytic systems Transaction management becomes a metadata problem Multi-version concurrency control and Snapshot isolation semantic Transaction coordination separated from storage and compute Allow for consistent access across compute resources
SCALABLE STORAGE AUTOMATIC MICRO-PARTITIONING Data is automatically partitioned at load time Storage decoupled from compute Partitions Columnar organization in each micro-partition Enable both horizontal vertical pruning Micro partition – only few 10MBs Fine grain pruning, no skew Metadata structure tracks data distribution Very fast pruning at optimization time Columnar Applied to both structured and semi-structured data Very fast response time for both
AUTOMATICALLY APPLIED TO SEMI-STRUCTURED DATA Semi-structured data > SELECT … FROM … (JSON, Avro, XML, Parquet, ORC) Structured data Optimized SQL (e.g., CSV, TSV, …) querying Full benefit of database optimizations (pruning, filtering, …) Native support Loaded in raw form (e.g. JSON, Avro, XML) Optimized storage Optimized data type, no fixed schema or transformation required
EXAMPLE Client Application JDBC Driver Web UI ODBC Driver HTTPS (JDBC/ODBC/Python) XL L L Compute Cloud Query Warehouse Security Optimization Custom Campaign Loading Services Mgmt Mgmt Reports Analysts WH DDL P Q R S L C 2 H S W 1 T 3 6 V 8 Node Node Node Node Node Node Node Node Node Node Node Node T U V W 4 O 7 K Q Metadata Metadata 2 B D U F Metadata Node Node Node Node Node Node Node Node Node Node Node Node G P J 8 3 Loading WH Campaign Analysis Node Node Node Node R F 6 B 1 Storage Storage 1 2 3 4 5 6 7 8 P Q R S Node Node Node Node 9 A B C D E F G T U V W S3 Custom Reports H I J K L M N O Data Sale Marketing s
ENABLE DATA SHARING Providers Consumers Secure and integrated Get access to the data Snowflake’s access control without any need to move model or transform it. Only pay normal storage costs Query and combine shared for shared data data with existing data or join together data from No limit to the number of multiple publishers consumer accounts with which a dataset may be shared Data Consumers Data Providers
ENABLE GLOBAL REPLICATION Azure AWS (Frankfurt) (Ireland) Azure AWS (US East) (Frankfurt) AWS AWS (US West) (US East) Azure AWS (Sydney) AWS
MULTI-TENANT SERVICE
DATA WAREHOUSE AS A SERVICE Multi-Tenant Service Availability Durability No administration, self-tuning All tier distributed over multiple Synchronous replication of and healing, datacenters with active-active data over multiple data centers data replication Transparent upgrade Automatic data retention and No maintenance downtime, fail safe technology to guard Service architecture designed fully transparent software & against any data removal for high availability and hardware upgrade durability Automatic repair of any failed Security is at the core servers with transparent re- execution of any failed queries Persistent session for load- balancing and transparent fail-over
SNOWFLAKE SERVICE Three independent layers Authentication & Access Control Cloud services Infrastructure Transaction Compilation and Management Optimizer Security manager manager Metadata Data processing Cache Cache Cache Cache Virtual warehouses Storage Databases
MANAGED SERVICE BUILT-IN DISASTER RECOVERY AND HIGH AVAILABILITY Scale-out of all tiers metadata, compute, storage Services Cloud Resiliency across multiple services availability zones Metadata geographic separation separate power grids built for synchronous replication Virtual Fully online updates & patches warehouses zero downtime Database Back pressure and throttling storage all the way back to the client
ADAPTIVE ALL THE WAY TO THE CORE SELF TUNING & SELF HEALING INTERNALS Adaptive Automatic Automatic Automatic Memory Distribution Degree of Self-tuning Management Method Parallelism Do no harm! Automatic No Vacuuming Automatic Automatic Fault Workload Default No Statistics Handling Management
EXAMPLE: AUTOMATIC SKEW AVOIDANCE 1 Detect popular values on the build side of the join 2 Use broadcast for those and directed join for the others Execution Plan Adaptive popular values detected at runtime 2 1 join Self-tuning number of values filter Do no harm! no performance degradation scan scan Automatic kicks in when needed Default enabled by default for all joins
WHAT’S NEXT? SERVERLESS DATA SERVICES Target predictable well-identified database workloads Horizontal scaling is automatic Fine grain unit of work allow for degree of parallelism to be arbitrarily small or large Secure since handled by the service Transparent retry on failures Service state entirely managed by the service Monitoring and observability of the service
CLOUD NATIVE ARCHITECTURE A GIFT THAT KEEPS ON GIVING
Recommend
More recommend