Innovation at AWS Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift
The Amazon Flywheel Focus on things that stay the same Price Selection Delivery
Applying this at AWS
Focus on things that stay the same Performance Amazon Value Redshift Simplicity
Adopt a retail mindset
Customers have choice Delight them and they’ll stay Earn their business one hour at a time
Start with the Customer Work Backwards
What Do Customers Want? • What problems are customers facing? • How will my service alleviate this pain? • Why will this idea delight customers? • Why can I do this better than anyone else?
What we heard from customers about DW • Complicated to install, maintain, operate • Require large upfront payments • Too expensive • Always running out of capacity
Press Release Describe the product in terms of customer value Why will customers care? Is it newsworthy? How is this differentiated?
FAQ Answer customer questions How does this help me? How do I get started? How will this work with my ETL/BI tools? When should I use this vs. Hadoop?
2 pizza teams An individual team should be no larger than can be fed • by two pizzas. Beyond this size, you define contracts and interfaces • with other teams Attention is a scarce resource. Time is a scarce resource • Apply attention and time to changing reality, not • communicating status.
Build the Product Assemble Internal Private Build Launch Iterate a Team Beta Beta
Iterate
Add Features Get Feedback that matter Increase Raise Value Adoption
Redshift pushes a new DB version every two weeks. 120+ features since launch Unload logs (7/5) Temp Credentials (4/11) Sharing snapshots (7/18) DUB (4/25) Resource Level IAM (8/9) Kinesis EMR/HDFS/SSH copy, SHA1 Builtin (7/15) 3 new regex features, Unload to single Distributed Tables, Audit SOC1/2/3 (5/8) file, FedRAMP(5/6) Logging/CloudTrail, Concurrency, Resize Statement Timeout (7/22) Perf., Approximate Count Distinct, SNS WLM Timeout/Wildcards (8/1) Alerts, Cross Region Backup (11/13) UTF-8 Substitution (8/29) Resize progress indicator & Cluster JDBC Fetch Size (6/27) Version (3/21) Service Launch (2/14) New query monitoring system tables and Split_part, Audit tables (10/3) diststyle all (1/13) 50 slots, COPY from EMR, ECDHE EIP Support for VPC Clusters (12/28) ciphers (4/22) Redshift on DW2 (SSD) Nodes (1/23) PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 Regex_Substr, COPY from JSON (3/25) SIN/SYD (10/8) (12/13) PDX (4/2) JSON, Regex, Cursors (9/10) Compression for COPY from SSH, Fetch NRT (6/5) HSM Support (11/11) CRC32 Builtin, CSV, Restore Progress size support for single node clusters, (8/9) new system tables with commit stats, row_number(), strotol() and query Timezone, Epoch, Autoformat (7/25) termination (2/13) 4 byte UTF-8 (7/18) Unload Encrypted Files
Collect Store Analyze Athena EMR AWS Import/ Direct Connect S3 Glacier Export Snowball Machine Redshift Learning AWS IoT Kinesis DynamoDB Elasticsearch QuickSight EC2 Lambda AWS Glue AWS Database Migration Service
Collection & Storage • Store anything • Object storage • Designed for 99.999999999% durability Amazon S3 • Scalable & Cost effective; $0.023/GB-Mo • Integrated with Amazon Glacier • Support for multiple encryption methods; integrated with AWS KMS, with support for external HSMs
Data Management & ETL • Hive Metastore-compatible data catalog with integrated crawlers for schema, data type, and partition inference • Generates Python code to move data from source to destination AWS Glue • Edit jobs using your favorite IDE and share snippets via Git • Runs jobs in Spark containers that auto-scale based on SLA • Serverless with no infrastructure to manage; pay only for the resources you consume
Amazon RDS for Aurora MySQL compatible with up to 5x better performance on the • same hardware: 100,000 writes/sec & 500,000 reads/sec Scalable with up to 64 TB in single database, up to 15 read • replicas Highly available, durable, and fault-tolerant custom SSD storage • layer: 6-way replicated across 3 Availability Zones Transparent encryption for data at rest using AWS KMS • Stored procedures in Aurora can invoke AWS Lambda functions • MySQL & PostgreSQL compatible engines •
Structured Data Processing Petabyte-scale relational, MPP, data warehousing clusters with the • ability to join across Exabytes of data in S3 using Redshift Spectrum, a serverless scale out query layer that charges $5/TB scanned Fully managed with SSD and HDD platforms • Built-in end to end security, including customer-managed keys • Fault tolerant. Automatically recovers from disk and node failures • Data automatically backed up to Amazon S3 with cross region • Amazon Redshift backup capability for global disaster recovery $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale from • 160GB to 2PB of compressed data with just a few clicks
Semi-structured / Unstructured Data Processing Hadoop, Hive, Presto, Spark, Tez, Impala etc. • Release 5.3: Hadoop 2.7.3, Hive 2.1, Spark 2.1, Zeppelin, Presto, HBase – 1.2.3 and HBase on S3, Phoenix, Tez, Flink. New applications added within 30 days of their open source release – Fully managed, autoscaling clusters with support for on-demand • and spot pricing Support for HDFS and S3 filesystems enabling separated compute Amazon EMR • and storage; multiple clusters can run against the same data in S3 HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 • client-side encryption with customer managed keys and AWS KMS
Serverless Query Processing Serverless query service for querying data in S3 using standard SQL, • with no infrastructure to manage No data loading required; query directly from Amazon S3 • Use standard ANSI SQL queries with support for joins, JSON, and • window functions Amazon Athena Support for multiple data formats include text, CSV, TSV, JSON, • Avro, ORC, Parquet Pay per query only when you’re running queries based on data • scanned. If you compress your data, you pay less and your queries run faster
Serverless Event Processing Server-less compute service that runs your code in • response to events Extend AWS services with user defined custom logic • Write custom code in Node.js, Python, and Java • AWS Lambda Pay only for the requests served and compute time • required - billing in increments of 100 milliseconds
Stream Processing Real-time stream processing • High throughput; elastic • Highly available; data replicated across multiple • Availability Zones with configurable retention Amazon Kinesis S3, Redshift, DynamoDB Integrations • Kinesis Streams for custom streaming applications; • Kinesis Firehose for easy integration with Amazon S3 and Redshift; Kinesis Analytics for streaming SQL
Search and Operational Analytics Distributed search and analytics engine • Managed service using Elasticsearch and Kibana • Fully managed; Zero admin • Amazon Elasticsearch Highly Available and Reliable • Service Tightly integrated with other AWS services •
Predictive Applications Easy to use, managed service built for developers - • Deploy models to in seconds Robust, powerful technology based on Amazon’s • internal systems Create models using your data already stored in the • Amazon ML AWS cloud; deploy models in batch and real time modes Spark on Amazon EMR also available for custom • machine learning applications
Business Intelligence Fast and cloud-powered • Easy to use, no infrastructure to manage • Scales to 100s of thousands of users • Amazon QuickSight Quick calculations with SPICE • 1/10th the cost of legacy BI software •
Amazon Redshift
Amazon SWF Amazon VPC Amazon EC2 AWS IAM OLAP MPP Columnar PostgreSQL Amazon Redshift Amazon S3 Amazon Amazon AWS KMS CloudWatch Route 53
Redshift Cluster Architecture SQL Clients/BI Tools Massively parallel, shared nothing • JDBC/ODBC Leader node • 128GB RAM SQL endpoint – Leader 16 cores Node Stores metadata – 16TB disk 10 GigE Coordinates parallel SQL processing – (HPC) Compute nodes • Local, columnar storage – 128GB RAM 128GB RAM 128GB RAM Compute Compute Compute 16 cores 16 cores 16 cores Executes queries in parallel – Node Node Node 16TB disk 16TB disk 16TB disk Load, backup, restore – Ingestion S3 / EMR / DynamoDB / SSH Backup Restore
Brute force only takes you so far…
Designed for I/O Reduction CREATE TABLE audience ( Columnar storage • aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); Data compression • aid loc dt 1 SFO 2016-09-01 Zone maps • 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt • Accessing dt with row storage: – Need to read everything – Unnecessary I/O
Recommend
More recommend