Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin @rxin Stanford University, 2019-05-15
About Me Databricks co-founder & Chief Architect - Designed most major things in “modern day” Apache Spark - #1 contributor to Spark by commits and net lines deleted PhD in databases from Berkeley
Building data analytics platform is hard ???? Data streams Insights
Traditional Data Warehouses ETL SQL OLTP Data Warehouse Insights databases
Challenges with Data Warehouses ETL pipelines are often complex and slow Ad-hoc pipelines to process data and ingest into warehouse No insights until daily data dumps have been processed Data Workloads often limited to SQL and BI tools Warehouse Data in proprietary formats Hard to do integrate streaming, ML, and AI workloads Performance is expensive Scaling up/out usually comes at a high cost
Dream of Data Lakes SQL ML, AI scalable ETL streaming Data streams Insights Data Lake
Data Lakes + Spark = Awesome! SQL, ML, STRUCTURED STREAMING STREAMING Data streams Insights Data Lake The 1st Unified Analytics Engine
Advantages of Data Lakes ETL pipelines are complex and slow simpler and fast Unified Spark API between batch and streaming simplifies ETL Raw unstructured data available as structured data in minutes Workloads limited not limited anything! Data Lake Data in files with open formats Integrate with data processing and BI tools Integrate with ML and AI workloads and tools Performance is expensive cheaper Easy and cost-effective to scale out compute and storage
Challenges of Data Lakes in practice
Challenges of Data Lakes in practice ETL @
Evolution of a Cutting-Edge Data Pipeline Events Streaming ? Analytics Data Lake Reporting
Evolution of a Cutting-Edge Data Pipeline Events Streaming Analytics Data Lake Reporting
Challenge #1: Historical Queries? λ -arch 1 λ -arch Events 1 λ -arch Streaming 1 Analytics Data Lake Reporting
Challenge #2: Messy Data? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Analytics 2 Validation Data Lake Reporting
Challenge #3: Mistakes and Failures? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Reprocessing 3 Analytics 2 Validation Partitioned 3 Reprocessing Data Lake Reporting
Challenge #4: Query Performance? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Reprocessing 3 Analytics Compaction 4 2 Validation Partitioned 2 Scheduled to 4 Avoid Compaction Reprocessing Data Lake 4 Reporting Compact Small Files
Data Lake Reliability Challenges Failed production jobs leave data in corrupt state requiring tedious recovery Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads Lack of schema enforcement creates inconsistent and low quality data
Data Lake Performance Challenges Too many small or very big files - more time opening & closing files rather than reading content (worse with streaming) Partitioning aka “poor man’s indexing”- breaks down when data has many dimensions and/or high cardinality columns Neither storage systems, nor processing engines are great at handling very large number of subdir/files
Figuring out what to read is too slow
Data integrity is hard
Band-aid solutions made it worse!
Everyone has the same problems
THE GOOD THE GOOD OF DATA LAKES OF DATA WAREHOUSES • Massive scale out • Pristine Data • Open Formats • Transactional Reliability • Mixed workloads • Fast SQL Queries
DELTA The The The RELIABILITY & SCALE LOW-LATENCY PERFORMANCE of data lake of streaming of data warehouse
Scalable storage + = DELTA Transactional log
DELTA pathToTable/ Scalable storage +---- 000.parquet table data stored as Parquet files +---- 001.parquet on HDFS, AWS S3, Azure Blob Stores +---- 002.parquet + ... | Transactional log +---- _delta_log/ +---- 000.json sequence of metadata files to track +---- 001.json operations made on the table ... stored in scalable storage along with table
Log Structured Storage | INSERT actions +---- _delta_log/ Changes to the table Add 001.parquet are stored as ordered, Add 002.parquet +---- 000.json atomic commits UPDATE actions +---- 001.json Remove 001.parquet Each commit is a set of ... Remove 002.parquet actions file in directory _delta_log Add 003.parquet
Log Structured Storage | INSERT actions +---- _delta_log/ Readers read the log in Add 001.parquet atomic units thus reading Add 002.parquet +---- 000.json consistent snapshots UPDATE actions +---- 001.json Remove 001.parquet ... readers will read Remove 002.parquet either [001+002].parquet Add 003.parquet or 003.parquet and nothing in-between
Mutual Exclusion Concurrent writers 000.json Writer 1 Writer 2 need to agree on the 001.json order of changes 002.json New commit files must be created mutually only one of the writers trying exclusively to concurrently write 002.json must succeed
Challenges with cloud storage Different cloud storage systems have different semantics to provide atomic guarantees Cloud Storage Atomic Atomic Solution Files Put if Visibility absent ✘ Azure Blob Store, Write to temp file, rename to ✔ Azure Data Lake final file if not present ✘ AWS S3 Separate service to perform all ✔ writes directly (single writer)
Concurrency Control Pessimistic Concurrency ✔ Avoid wasted work Block others from writing anything ✘ Distributed locks Hold lock, write data files, commit to log ✔ Mutual exclusion is enough! Optimistic Concurrency Assume it’ll be okay and write data files ✘ Breaks down if there a lot Try to commit to the log, fail on conflict of conflicts Enough as write concurrency is usually low
Solving Conflicts Optimistically 1. Record start version User 1 User 2 2. Record reads/writes 000000.json R: A R: A 3. If someone else wins, W: B W: C 000001.json check if anything you 000002.json read has changed. 4. Try again. new file C does not conflict with new file B, so retry and commit successfully as 2.json
Solving Conflicts Optimistically 1. Record start version User 1 User 2 2. Record reads/writes 000000.json R: A R: A 3. If someone else wins, W: A,B W: A,C 000001.json check if anything you read has changed. 4. Try again. Deletions of file A by user 1 conflicts with deletion by user 2, user 2 operation fails
Metadata/Checkpoints as Data Large tables can have millions of files in them! Even pulling them out of Hive [MySQL] would be a bottleneck. Add 1.parquet Checkpoint Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet
Challenges solved: Reliability Problem: Failed production jobs leave data in corrupt state requiring tedious recovery Solution: Failed write jobs do not update the commit log, DELTA hence partial / corrupt files not visible to readers
Challenges solved: Reliability Challenge : Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads Solution: All reads have full snapshot consistency All successful writes are consistent DELTA In practice, most writes don't conflict Tunable isolation levels (serializability by default)
Challenges solved: Reliability Challenge : Lack of schema enforcement creates inconsistent and low quality data Solution: Schema recorded in the log Fails attempts to commit data with incorrect schema DELTA Allows explicit schema evolution Allows invariant and constraint checks (high data quality)
Challenges solved: Performance Challenge: Too many small files increase resource usage significantly Solution: Transactionally performed compaction using OPTIMIZE DELTA OPTIMIZE table WHERE date = '2019-04-04'
Challenges solved: Performance Challenge: Partitioning breaks down with many dimensions and/or high cardinality columns Solution: Optimize using multi-dimensional clustering on multiple columns DELTA OPTIMIZE conns WHERE date = '2019-04-04' ZORDER BY (srcIP, destIP)
Querying connection data at Apple Ad-hoc query of connection data based on different columns SELECT count(*) FROM conns Connections WHERE date = '2019-04-04' - date AND srcIp = '1.1.1.1' - srcIp - dstIp > PBs SELECT count(*) FROM conns > trillions of rows partitioning is bad as WHERE date = '2019-04-04' cardinality is high AND dstIp = '1.1.1.1'
Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 8
Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 ideal file size = 4 rows 8
Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' 2 files AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 8
Recommend
More recommend