cloud big data architectures
play

Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil - PowerPoint PPT Presentation

Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016 About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization Bonus (if time allows) 4.


  1. Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016

  2. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization Bonus …(if time allows) 4.

  3. Save ALL of your Data

  4. “ What is the ACTUAL Cost of ✘ Saving all Data ✘ Using newer technologies ✘ Going beyond Relational

  5. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities Bonus …(if time allows) 4.

  6. 1. Big Data – Yes! But what kind?

  7. Pattern 1 ✘ Which type(s) of Big Data work best? -- when to use Hadoop -- when to use NoSQL and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational and what type of workload for hot, warm or cold data

  8. Choice … is good, right?

  9. “ When do I use …? ✘ Hadoop ✘ NoSQL ✘ Big Relational

  10. Size Matters

  11. I don ’ t Want Text here One Vendor’s View

  12. Where is Hadoop Used?

  13. Hadoop is your LAST CHOICE ✘ Volume ✘ Velocity and Variety ✘ Veracity ✘ 10 TB or greater to start ✘ Spark over HIVE ✘ Pay, train and hire team ✘ Growth of 25% YOY ✘ Kafka and Samsa ✘ Top $$$ for talent ✘ Where FROM ✘ IF you can find it ✘ Where TO ✘ WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘ Complexity of ecosystem ✘ Cloudera knows best

  14. “ When do I use …? ✘ Hadoop ✘ NoSQL ✘ Big Relational

  15. 225 NoSQL Database Types to Choose From

  16. Let’s review some NoSQL concepts Key-Value Graph Document Redis, Riak, Aerospike Neo4j MongoDB Wide-Column Cassandra, HBase

  17. Key Questions - Storage ✘ Volume – how much now, what growth rate? ✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k - v’, etc… ✘ Velocity – batches, streams, both, what ingest rate? ✘ Veracity – current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?

  18. NoSQL Example ✘ Open Source is Free ✘ Not Free   Rapid iteration, innovation Constant releases   Can start up for free (on premise) Can be deceptively hard to set up (time is money)  Can ‘rent’ for cheap or free on the cloud  Don’t forget to turn it off if on the cloud!  Can use with the command line for free  GUI tools, support, training cost $$$  Some vendors offer free online training  Ex. www.neo4j.com  Ex. www.neo4j.org 21

  19. Practice Applying Concepts - NoSQL

  20. NoSQL Applied Line-of- Business Social • ??? aggregators Social • ??? Games Product • ??? Catalogs Log Files • ??? • ???

  21. NoSQL Applied Line-of- Business Social • RDBMS aggregators • SQL Server Social • Graph Games • Neo4j Product • Document Catalogs • MongoDB Log Files • Key/Value • Columnstore • Redis • HBase

  22. More than NoSQL NoSQL NewSQL U-SQL ✘ Non-relational ✘ Relational plus more ✘ What??? ✘ Can be optimized in- ✘ Often in-memory ✘ Microsoft’s universal SQL ✘ Some kind of SQL-layer memory language ✘ Eventually consistent ✘ Schema on Write ✘ Example: Azure Data Lake ✘ Schema on Read ✘ Example: MemSQL ✘ Example: Aerospike

  23. Focus

  24. How Best to Store your Data? Developer Complexity Scalability Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high

  25. Hadoop 5% NoSQL 30% RDBMS 65% Real World Big Data -- When do I use what?

  26. “ Do the Cloud Vendors Understand Big Data Realities?

  27. Cloud Big Data Vendors - Storage AWS GCP Azure ✘ 5-10X market share of next ✘ Lean, mean and cheap ✘ Catching up ✘ Fastest player ✘ Best tooling integration competitor ✘ Most complete offering ✘ Requires top developers ✘ Notable: On-premise ✘ Most mature offering ✘ Notable: Query as a integration ✘ Notable: Big Relational Service

  28. Place your screenshot here AWS Console 17 Data services

  29. Place your screenshot here GCP Console 8 Data Services

  30. Place your screenshot here Azure Console 15 Data Services

  31. Cloud Offerings – Big Data AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Cloud Storage Azure Blobs Glacier Nearline StorSimple NoSQL Key-Value DynamoDB Big Table Azure Tables NoSQL Wide Column Cloud Datastore NoSQL Document MongoDB on EC2 MongoDB on GCE DocumentDB NoSQL Graph Neo4j on EC2 Neo4j on GCE Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight

  32. Practice Applying Concepts – Real Cost of Storage Types

  33. Cloud NoSQL Applied – AWS Line-of- Business Social aggregators Social Games Product Catalogs Log Files

  34. Cloud NoSQL Applied – AWS Line-of- Business Social • RDBMS aggregators • RDS Social • Graph Games • Neo4j Product • Document Catalogs • MongoDB Log Files • Key/Value • Stream or • DynamoDB Hadoop • Kinesis or EMR

  35. ??? The fastest growing cloud-based Big Data products are …

  36. Relational The fastest growing cloud-based Big Data products are …

  37. “ When do I use …? ✘ Hadoop ✘ NoSQL ✘ Big Relational

  38. Practice Applying Concepts – Real Cost of Storage Types

  39. Reasons to use Big Relational Cloud Services Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP Developers DevOps

  40. Reasons to use Big Relational Cloud Services Developers DevOps Cloud Vendors - AWS Aurora – RDBMS up to 64 TB Most know RDBMS query patterns Most know RDBMS administration Many know basic administration Many know basic RDBMS queries Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Many know query optimization Integration with AWS products Developers DevOps Cloud Vendors - GCP Big Query – familiar SQL queries Most know coding language Familiar RDBMS security patterns patterns to interact with RDBMS Familiar auditing No hassle streaming ingest systems Partner tooling integration No hassle pay-as-you-go Zero administration

  41. My top Big Data Cloud Services

  42. ETL is 75% of all Big Data Projects Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.

  43. About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities Bonus …(if time allows) 4.

  44. 2. Data Pipelines Build vs. Buy

  45. Pattern 2 ✘ How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds

  46. Key Questions – Ingestion and ETL ✘ Volume – how much and how fast, now and future? ✘ Variety – what type(s) or data, any pre-processing needed? ✘ Velocity – batches or steaming? ✘ Veracity – verification on ingest needed? new data needed?

  47. Together How does your data pipeline flow?

  48. “ Considering … ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream

  49. Pipeline Phases Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor

  50. Cloud Big Data Vendors - ETL AWS GCP Azure ✘ 5X market share of next ✘ Lean, mean and cheap ✘ Difficulty with scale ✘ Fastest player ✘ Best tooling integration competitor ✘ Notable: Many, strong ETL ✘ Notable: DataFlow requires ✘ Notable: Nothing Partners Java or Python developers

  51. How Best to Ingest and ETL your Data? Developer Complexity Scalability Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high

  52. “ Considering … ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream

  53. Building a Streaming Pipeline Stream Interval Window

  54. “ Near Real-time Streams Load Test All The Things

  55. Key Questions - Streaming ✘ Volume – how much data now and predicted over next 12 months? ✘ Variety – what types of data now and future? ✘ Velocity – volume of input data / time now and near future? ✘ Veracity – volume of EXISTING data now

  56. Cloud Big Data Vendors - Streaming AWS GCP Azure ✘ 5X market share of next ✘ Lean, mean and cheap ✘ Catching up ✘ Fastest player ✘ Best tooling integration competitor ✘ Most complete offering ✘ Requires top developers ✘ Notable: Stream Analytics ✘ Most mature offering ✘ Notable: DataFlow flexible integration with other ✘ Notable: Kinesis Firehose products

  57. Place your screenshot here AWS Console 17 Data services

  58. Place your screenshot here GCP Console 8 Data Services

  59. Place your screenshot here Azure Console 15 Data Services

Recommend


More recommend