Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016
About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization Bonus …(if time allows) 4.
Save ALL of your Data
“ What is the ACTUAL Cost of ✘ Saving all Data ✘ Using newer technologies ✘ Going beyond Relational
About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities Bonus …(if time allows) 4.
1. Big Data – Yes! But what kind?
Pattern 1 ✘ Which type(s) of Big Data work best? -- when to use Hadoop -- when to use NoSQL and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational and what type of workload for hot, warm or cold data
Choice … is good, right?
“ When do I use …? ✘ Hadoop ✘ NoSQL ✘ Big Relational
Size Matters
I don ’ t Want Text here One Vendor’s View
Where is Hadoop Used?
Hadoop is your LAST CHOICE ✘ Volume ✘ Velocity and Variety ✘ Veracity ✘ 10 TB or greater to start ✘ Spark over HIVE ✘ Pay, train and hire team ✘ Growth of 25% YOY ✘ Kafka and Samsa ✘ Top $$$ for talent ✘ Where FROM ✘ IF you can find it ✘ Where TO ✘ WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘ Complexity of ecosystem ✘ Cloudera knows best
“ When do I use …? ✘ Hadoop ✘ NoSQL ✘ Big Relational
225 NoSQL Database Types to Choose From
Let’s review some NoSQL concepts Key-Value Graph Document Redis, Riak, Aerospike Neo4j MongoDB Wide-Column Cassandra, HBase
“
Key Questions - Storage ✘ Volume – how much now, what growth rate? ✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k - v’, etc… ✘ Velocity – batches, streams, both, what ingest rate? ✘ Veracity – current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?
NoSQL Example ✘ Open Source is Free ✘ Not Free Rapid iteration, innovation Constant releases Can start up for free (on premise) Can be deceptively hard to set up (time is money) Can ‘rent’ for cheap or free on the cloud Don’t forget to turn it off if on the cloud! Can use with the command line for free GUI tools, support, training cost $$$ Some vendors offer free online training Ex. www.neo4j.com Ex. www.neo4j.org 21
Practice Applying Concepts - NoSQL
NoSQL Applied Line-of- Business Social • ??? aggregators Social • ??? Games Product • ??? Catalogs Log Files • ??? • ???
NoSQL Applied Line-of- Business Social • RDBMS aggregators • SQL Server Social • Graph Games • Neo4j Product • Document Catalogs • MongoDB Log Files • Key/Value • Columnstore • Redis • HBase
More than NoSQL NoSQL NewSQL U-SQL ✘ Non-relational ✘ Relational plus more ✘ What??? ✘ Can be optimized in- ✘ Often in-memory ✘ Microsoft’s universal SQL ✘ Some kind of SQL-layer memory language ✘ Eventually consistent ✘ Schema on Write ✘ Example: Azure Data Lake ✘ Schema on Read ✘ Example: MemSQL ✘ Example: Aerospike
Focus
How Best to Store your Data? Developer Complexity Scalability Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high
Hadoop 5% NoSQL 30% RDBMS 65% Real World Big Data -- When do I use what?
“ Do the Cloud Vendors Understand Big Data Realities?
Cloud Big Data Vendors - Storage AWS GCP Azure ✘ 5-10X market share of next ✘ Lean, mean and cheap ✘ Catching up ✘ Fastest player ✘ Best tooling integration competitor ✘ Most complete offering ✘ Requires top developers ✘ Notable: On-premise ✘ Most mature offering ✘ Notable: Query as a integration ✘ Notable: Big Relational Service
Place your screenshot here AWS Console 17 Data services
Place your screenshot here GCP Console 8 Data Services
Place your screenshot here Azure Console 15 Data Services
Cloud Offerings – Big Data AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Cloud Storage Azure Blobs Glacier Nearline StorSimple NoSQL Key-Value DynamoDB Big Table Azure Tables NoSQL Wide Column Cloud Datastore NoSQL Document MongoDB on EC2 MongoDB on GCE DocumentDB NoSQL Graph Neo4j on EC2 Neo4j on GCE Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight
Practice Applying Concepts – Real Cost of Storage Types
Cloud NoSQL Applied – AWS Line-of- Business Social aggregators Social Games Product Catalogs Log Files
Cloud NoSQL Applied – AWS Line-of- Business Social • RDBMS aggregators • RDS Social • Graph Games • Neo4j Product • Document Catalogs • MongoDB Log Files • Key/Value • Stream or • DynamoDB Hadoop • Kinesis or EMR
??? The fastest growing cloud-based Big Data products are …
Relational The fastest growing cloud-based Big Data products are …
“ When do I use …? ✘ Hadoop ✘ NoSQL ✘ Big Relational
Practice Applying Concepts – Real Cost of Storage Types
Reasons to use Big Relational Cloud Services Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP Developers DevOps
Reasons to use Big Relational Cloud Services Developers DevOps Cloud Vendors - AWS Aurora – RDBMS up to 64 TB Most know RDBMS query patterns Most know RDBMS administration Many know basic administration Many know basic RDBMS queries Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Many know query optimization Integration with AWS products Developers DevOps Cloud Vendors - GCP Big Query – familiar SQL queries Most know coding language Familiar RDBMS security patterns patterns to interact with RDBMS Familiar auditing No hassle streaming ingest systems Partner tooling integration No hassle pay-as-you-go Zero administration
My top Big Data Cloud Services
ETL is 75% of all Big Data Projects Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.
About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities Bonus …(if time allows) 4.
2. Data Pipelines Build vs. Buy
Pattern 2 ✘ How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds
Key Questions – Ingestion and ETL ✘ Volume – how much and how fast, now and future? ✘ Variety – what type(s) or data, any pre-processing needed? ✘ Velocity – batches or steaming? ✘ Veracity – verification on ingest needed? new data needed?
Together How does your data pipeline flow?
“ Considering … ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream
Pipeline Phases Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor
Cloud Big Data Vendors - ETL AWS GCP Azure ✘ 5X market share of next ✘ Lean, mean and cheap ✘ Difficulty with scale ✘ Fastest player ✘ Best tooling integration competitor ✘ Notable: Many, strong ETL ✘ Notable: DataFlow requires ✘ Notable: Nothing Partners Java or Python developers
How Best to Ingest and ETL your Data? Developer Complexity Scalability Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
“ Considering … ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream
Building a Streaming Pipeline Stream Interval Window
“ Near Real-time Streams Load Test All The Things
Key Questions - Streaming ✘ Volume – how much data now and predicted over next 12 months? ✘ Variety – what types of data now and future? ✘ Velocity – volume of input data / time now and near future? ✘ Veracity – volume of EXISTING data now
Cloud Big Data Vendors - Streaming AWS GCP Azure ✘ 5X market share of next ✘ Lean, mean and cheap ✘ Catching up ✘ Fastest player ✘ Best tooling integration competitor ✘ Most complete offering ✘ Requires top developers ✘ Notable: Stream Analytics ✘ Most mature offering ✘ Notable: DataFlow flexible integration with other ✘ Notable: Kinesis Firehose products
Place your screenshot here AWS Console 17 Data services
Place your screenshot here GCP Console 8 Data Services
Place your screenshot here Azure Console 15 Data Services
Recommend
More recommend