big data technology ecosystem
play

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - PowerPoint PPT Presentation

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case Studies Pentaho Key


  1. Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

  2. Agenda • End-to-End Data Delivery Platform • Ecosystem of Data Technologies • Mapping an End-to-End Solution • Case Studies • Pentaho Key Capabilities • Summary • Q&A

  3. End-to-End Data Delivery Platform Ingest Process Publish Report • Data Agnostic • Native Hadoop Integration • Streamlined Data Refinery • Production Reporting • Metadata Driven Ingestion • Scale Up & Scale Out • Data Virtualization • Custom Dashboards • Data Orchestration • Blend Unstructured Data • Machine Learning • Self-Service Dashboards • Interactive Analysis • Embedded Analytics

  4. Delivering Insight Custom Dashboards Data Integration & Orchestration Self-Service Dashboards Ingest Process Publish Report Interactive Analysis Production Reporting Data Engineers Data Scientists Data Analyst Consumers

  5. Big Data Ecosystem 1 2 3 Relational Analytical NoSQL Database Database Databases 4 5 6 HDFS Map Reduce Distributed SQL on Hadoop Search 7 8 9 Event Stream Message Complex Event Processing (ESP) Streaming Processing (CEP)

  6. Data Source Attributes Small Medium Large Volume (Data Size) Relational Analytical NoSQL Database Structured Semi-Structured Unstructured Database Databases Variety (Data Type) Batch Micro-Batch RT Streaming Velocity Distributed HDFS Map Reduce SQL on Hadoop Search (Processing) Scheduled Prompted Interactive Latency Event Stream Complex Event Message (Reporting) Processing (ESP) Processing (CEP) Streaming

  7. Core Competency Good Fit Relational Database Not Optimal Not Recommended Relational MSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Database Small Operational databases for OLTP apps that require high transaction loads and user Volume concurrency. Can “scale up” to data volumes but lack ability to easily “scale-out” for Medium (Data Size) large data processing. Large Structured Structured schema of tables containing rows and columns of data emphasizing Variety integrity and consistency over speed and scale. Structured data accessed with the SQL Semi-Structured (Data Type) query language. Unstructured Batch Velocity Rigid schemas with batch-oriented ingestion and SQL query processing are not Micro-Batch designed for continuous streaming data (Processing) RT Streaming Scheduled Latency Optimized for frequent small CRUD queries (create, read, update, delete), not for Prompted (Reporting) analytic or interactive query workloads on large data Interactive

  8. Core Competency Good Fit Analytical Database Not Optimal Not Recommended Analytical Columnar, In-Memory, MPP, OLAP Database Teradata, Oracle Exadata, IBM Netezza, EMC Greenplum, Vertica Small Volume Data warehouse/mart databases to support BI and advanced analytics workloads. MPP Medium architecture gives ability to “scale out” to large data volumes at a financial cost. (Data Size) Large Structured Variety Structured schema of tables containing rows and columns of data offering improved Semi-Structured speed and scalability over RDBMS but still limited to structured data. (Data Type) Unstructured Batch Velocity Rigid schemas with batch-oriented SQL queries are not designed for streaming Micro-Batch applications. (Processing) RT Streaming Scheduled Latency All four types (Columnar, In-Memory, MPP, OLAP) designed for improved query Prompted (Reporting) performance for analytic or interactive query workloads on large data. Interactive

  9. Core Competency Good Fit NoSQL Database Not Optimal Not Recommended NoSQL MongoDB, HBase, Cassandra, MarkLogic, Couchbase Database Small Good for web applications - less web app code to write, debug and maintain. Scale out Volume - horizontal scaling w auto-sharding data to support millions of web app users. Medium (Data Size) Compromise on consistency (ACID transactions) in favor of scale & up-time. Large Structured Variety Hierarchical, key-value or document design to capture all types of data in a single Semi-Structured location. (Data Type) Unstructured Batch Schema-less design allows for rapid or continuous ingest at scale. Good storage option Velocity for high throughput, low latency requirements of streaming applications for real-time Micro-Batch (Processing) views of data. Seen as a key component to Lambda architecture. RT Streaming Scheduled Latency Low level query languages, lack of skills, lack SQL support makes NoSQL less appealing Prompted (Reporting) for reporting and analysis. Interactive

  10. Core Competency Good Fit HDFS MapReduce Not Optimal Not Recommended HDFS Map Cloudera, Hortonworks, MapR, Pivotal, Amazon EMR, Reduce Hitachi HSP, MSFT HDInsights Small Hadoop Distributed File System designed to distribute and replicate file blocks Volume horizontally scaled across multiple commodity data nodes. MapReduce programming Medium (Data Size) takes compute to the data for batch processing large data volumes. Large Structured Variety File system is schema-less allowing easy storage of any file type in multiple Hadoop file Semi-Structured formats. (Data Type) Unstructured Batch Velocity HDFS and MapReduce designed for distributing batch processing workloads on large Micro-Batch datasets, not for micro-batch or steaming use cases. (Processing) RT Streaming Scheduled Latency MapReduce on HDFS lacks SQL support and report queries are slow and less appealing Prompted (Reporting) for reporting and analysis. Interactive

  11. Core Competency Good Fit SQL on Hadoop Not Optimal Not Recommended Batch-oriented, Interactive, and In-Memory SQL on Apache Hive, Apache Drill/Phoenix, Hortonworks Hive on Tez, Hadoop Cloudera Impala, Pivotal HawQ, Spark SQL Small SQL queries on a metadata layer (Hcatalog) in Hadoop. The queries are converted to Volume MapReduce, Apache Tez, Impala MPP, and Spark and run on different storage formats Medium (Data Size) such as HDFS and HBase. Large Structured SQL was designed for structured data. Hadoop files may contain nested data, variable Variety data, schema-less data. A SQL-on-Hadoop engine must be able to translate all these Semi-Structured (Data Type) forms of data to flat relational data and optimize queries (Impala/Drill) Unstructured Batch Velocity SQL-on-Hadoop engines require smart and advanced workload managers for multi- Micro-Batch user workloads designed for query processing not stream processing. (Processing) RT Streaming Scheduled Ad-hoc reporting, iterative OLAP, and data mining) in single-user and multi-user Latency modes. For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and (Reporting) Prompted 7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more. Interactive

  12. Core Competency Good Fit Distributed Search Not Optimal Not Recommended Distributed ElasticSearch, Solr (based on Apache Lucene), Amazon CloudSearch Search Small Search engines have to deal with large systems with millions of documents and are Volume designed for index and search query processing at scale with clustering and Medium (Data Size) distributed architecture. Large Structured XML, CSV, RDBMS, Word, PDF ,ActiveMQ, AWS SQS, DynamoDB (Amazon NoSQL), Variety FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, RabbitMQ, Redis, and Semi-Structured (Data Type) Twitter. Unstructured Batch ES scalable to very large clusters with near real-time search. The demands of real time Velocity web applications require search results in near real time as new content is generated Micro-Batch (Processing) by users. Some contention handling concurrent search + index requests. RT Streaming Scheduled Latency Both use key-value pair query language. Solr is much more oriented towards text search while Elasticsearch is often used for more advanced querying, filtering, and Prompted (Reporting) grouping. Good for interactive search queries but not interactive analytical reporting. Interactive

  13. Core Competency Message Streaming Good Fit Not Optimal Not Recommended Message Kafka, JMS, AMQP Streaming Small Volume Kafka is an excellent low latency messaging platform that brokers massive message Medium streams for parallel ingestion into Hadoop (Data Size) Large Structured Variety Data sources, such as the internet of things, sensors, clickstream, and transactional Semi-Structured systems. (Data Type) Unstructured Batch Realtime streaming providing high throughput for both publishing and subscribing, Velocity with constant performance even with many terabytes of stored messages. Designed Micro-Batch (Processing) for streaming and can configure batch size for brokering micro batches of messages. RT Streaming Scheduled Latency Stream topics need to be processed by additional technology such as PDI, ESP, CEP, Prompted (Reporting) query processing engines for reporting. Interactive

Recommend


More recommend