innovation at aws
play

Innovation at AWS Eric Ferreira ericfe@amazon.com Principal - PowerPoint PPT Presentation

Innovation at AWS Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift The Amazon Flywheel Focus on things that stay the same Price Selection Delivery Applying this at AWS Focus on things that stay the same


  1. Innovation at AWS Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift

  2. The Amazon Flywheel Focus on things that stay the same Price Selection Delivery

  3. Applying this at AWS

  4. Focus on things that stay the same Performance Amazon Value Redshift Simplicity

  5. Adopt a retail mindset

  6. Customers have choice Delight them and they’ll stay Earn their business one hour at a time

  7. Start with the Customer Work Backwards

  8. What Do Customers Want? • What problems are customers facing? • How will my service alleviate this pain? • Why will this idea delight customers? • Why can I do this better than anyone else?

  9. What we heard from customers about DW • Complicated to install, maintain, operate • Require large upfront payments • Too expensive • Always running out of capacity

  10. Press Release Describe the product in terms of customer value Why will customers care? Is it newsworthy? How is this differentiated?

  11. FAQ Answer customer questions How does this help me? How do I get started? How will this work with my ETL/BI tools? When should I use this vs. Hadoop?

  12. 2 pizza teams An individual team should be no larger than can be fed • by two pizzas. Beyond this size, you define contracts and interfaces • with other teams Attention is a scarce resource. Time is a scarce resource • Apply attention and time to changing reality, not • communicating status.

  13. Build the Product Assemble Internal Private Build Launch Iterate a Team Beta Beta

  14. Iterate

  15. Add Features Get Feedback that matter Increase Raise Value Adoption

  16. Redshift pushes a new DB version every two weeks. 120+ features since launch Unload logs (7/5) Temp Credentials (4/11) Sharing snapshots (7/18) DUB (4/25) Resource Level IAM (8/9) Kinesis EMR/HDFS/SSH copy, SHA1 Builtin (7/15) 3 new regex features, Unload to single Distributed Tables, Audit SOC1/2/3 (5/8) file, FedRAMP(5/6) Logging/CloudTrail, Concurrency, Resize Statement Timeout (7/22) Perf., Approximate Count Distinct, SNS WLM Timeout/Wildcards (8/1) Alerts, Cross Region Backup (11/13) UTF-8 Substitution (8/29) Resize progress indicator & Cluster JDBC Fetch Size (6/27) Version (3/21) Service Launch (2/14) New query monitoring system tables and Split_part, Audit tables (10/3) diststyle all (1/13) 50 slots, COPY from EMR, ECDHE EIP Support for VPC Clusters (12/28) ciphers (4/22) Redshift on DW2 (SSD) Nodes (1/23) PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 Regex_Substr, COPY from JSON (3/25) SIN/SYD (10/8) (12/13) PDX (4/2) JSON, Regex, Cursors (9/10) Compression for COPY from SSH, Fetch NRT (6/5) HSM Support (11/11) CRC32 Builtin, CSV, Restore Progress size support for single node clusters, (8/9) new system tables with commit stats, row_number(), strotol() and query Timezone, Epoch, Autoformat (7/25) termination (2/13) 4 byte UTF-8 (7/18) Unload Encrypted Files

  17. Collect Store Analyze Athena EMR AWS Import/ Direct Connect S3 Glacier Export Snowball Machine Redshift Learning AWS IoT Kinesis DynamoDB Elasticsearch QuickSight EC2 Lambda AWS Glue AWS Database Migration Service

  18. Collection & Storage • Store anything • Object storage • Designed for 99.999999999% durability Amazon S3 • Scalable & Cost effective; $0.023/GB-Mo • Integrated with Amazon Glacier • Support for multiple encryption methods; integrated with AWS KMS, with support for external HSMs

  19. Data Management & ETL • Hive Metastore-compatible data catalog with integrated crawlers for schema, data type, and partition inference • Generates Python code to move data from source to destination AWS Glue • Edit jobs using your favorite IDE and share snippets via Git • Runs jobs in Spark containers that auto-scale based on SLA • Serverless with no infrastructure to manage; pay only for the resources you consume

  20. Amazon RDS for Aurora MySQL compatible with up to 5x better performance on the • same hardware: 100,000 writes/sec & 500,000 reads/sec Scalable with up to 64 TB in single database, up to 15 read • replicas Highly available, durable, and fault-tolerant custom SSD storage • layer: 6-way replicated across 3 Availability Zones Transparent encryption for data at rest using AWS KMS • Stored procedures in Aurora can invoke AWS Lambda functions • MySQL & PostgreSQL compatible engines •

  21. Structured Data Processing Petabyte-scale relational, MPP, data warehousing clusters with the • ability to join across Exabytes of data in S3 using Redshift Spectrum, a serverless scale out query layer that charges $5/TB scanned Fully managed with SSD and HDD platforms • Built-in end to end security, including customer-managed keys • Fault tolerant. Automatically recovers from disk and node failures • Data automatically backed up to Amazon S3 with cross region • Amazon Redshift backup capability for global disaster recovery $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale from • 160GB to 2PB of compressed data with just a few clicks

  22. Semi-structured / Unstructured Data Processing Hadoop, Hive, Presto, Spark, Tez, Impala etc. • Release 5.3: Hadoop 2.7.3, Hive 2.1, Spark 2.1, Zeppelin, Presto, HBase – 1.2.3 and HBase on S3, Phoenix, Tez, Flink. New applications added within 30 days of their open source release – Fully managed, autoscaling clusters with support for on-demand • and spot pricing Support for HDFS and S3 filesystems enabling separated compute Amazon EMR • and storage; multiple clusters can run against the same data in S3 HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 • client-side encryption with customer managed keys and AWS KMS

  23. Serverless Query Processing Serverless query service for querying data in S3 using standard SQL, • with no infrastructure to manage No data loading required; query directly from Amazon S3 • Use standard ANSI SQL queries with support for joins, JSON, and • window functions Amazon Athena Support for multiple data formats include text, CSV, TSV, JSON, • Avro, ORC, Parquet Pay per query only when you’re running queries based on data • scanned. If you compress your data, you pay less and your queries run faster

  24. Serverless Event Processing Server-less compute service that runs your code in • response to events Extend AWS services with user defined custom logic • Write custom code in Node.js, Python, and Java • AWS Lambda Pay only for the requests served and compute time • required - billing in increments of 100 milliseconds

  25. Stream Processing Real-time stream processing • High throughput; elastic • Highly available; data replicated across multiple • Availability Zones with configurable retention Amazon Kinesis S3, Redshift, DynamoDB Integrations • Kinesis Streams for custom streaming applications; • Kinesis Firehose for easy integration with Amazon S3 and Redshift; Kinesis Analytics for streaming SQL

  26. Search and Operational Analytics Distributed search and analytics engine • Managed service using Elasticsearch and Kibana • Fully managed; Zero admin • Amazon Elasticsearch Highly Available and Reliable • Service Tightly integrated with other AWS services •

  27. Predictive Applications Easy to use, managed service built for developers - • Deploy models to in seconds Robust, powerful technology based on Amazon’s • internal systems Create models using your data already stored in the • Amazon ML AWS cloud; deploy models in batch and real time modes Spark on Amazon EMR also available for custom • machine learning applications

  28. Business Intelligence Fast and cloud-powered • Easy to use, no infrastructure to manage • Scales to 100s of thousands of users • Amazon QuickSight Quick calculations with SPICE • 1/10th the cost of legacy BI software •

  29. Amazon Redshift

  30. Amazon SWF Amazon VPC Amazon EC2 AWS IAM OLAP MPP Columnar PostgreSQL Amazon Redshift Amazon S3 Amazon Amazon AWS KMS CloudWatch Route 53

  31. Redshift Cluster Architecture SQL Clients/BI Tools Massively parallel, shared nothing • JDBC/ODBC Leader node • 128GB RAM SQL endpoint – Leader 16 cores Node Stores metadata – 16TB disk 10 GigE Coordinates parallel SQL processing – (HPC) Compute nodes • Local, columnar storage – 128GB RAM 128GB RAM 128GB RAM Compute Compute Compute 16 cores 16 cores 16 cores Executes queries in parallel – Node Node Node 16TB disk 16TB disk 16TB disk Load, backup, restore – Ingestion S3 / EMR / DynamoDB / SSH Backup Restore

  32. Brute force only takes you so far…

  33. Designed for I/O Reduction CREATE TABLE audience ( Columnar storage • aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); Data compression • aid loc dt 1 SFO 2016-09-01 Zone maps • 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt • Accessing dt with row storage: – Need to read everything – Unnecessary I/O

Recommend


More recommend