StreamCloud: A Large Scale Data Streaming System Gulisano, Vincenzo - PowerPoint PPT Presentation

StreamCloud: A Large Scale Data Streaming System Gulisano, Vincenzo Jimenez-Peris, Ricardo Patino-Martinez, Marta Valduriez, Patrick Rokey Ge

Outline The need for Data Stream Processing Current Stream Processing Engines Introducing StreamCloud Scalability, transparency, portability Evaluations My thoughts

Data Streaming Applications that require real time processing of data streams Financial data analysis Sensor network data Military command & control Store and process can't deal with the high volume and low latency requirements Stream processing engines (SPEs)

Data Streaming Data stream: infinite append-only sequence of tuples Queries are defined over one or more data streams Each query is a network of operators Stateless: filter, map, union Stateful: join, aggregate (computation over sliding windows of tuples)

Data Streaming Emerging applications are pushing the limit of SPEs Network monitoring, fraud detection Distributed SPEs Distribute queries, or operators to individual nodes Parallel SPEs Same queries or operators on different nodes in parallel

SPEs Aurora [D.J.Abadi et al] Splitting the load across several nodes running the same operator. Data stream go through single nodes,bottlenecks. Flux [M.A.Shah et al] Exchange parallel operator, specific to SPEs Limited evaluations Simulated, limited scope

StreamCloud A data stream processing system Scalability: scale with respect to the data stream volume Transparency: parallelisation of queries without user intervention Portability: independent of underlying SPE

Scalability Query cluster strategy Full query allocated to a subcluster of nodes Nodes execute on a subset of input Communication across nodes, at least for each stateful operator

Scalability Operator-cluster strategy Each operator to a set of nodes Communication between nodes of one subcluster to the next

Scalability Subquery-cluster strategy Subquery: a stateful operator followed by stateless operators; or the whole query if no stateful operator Subquery to nodes

Scalability Subquery-cluster strategy Minimum number of communication steps Minimum fan out cost Parallelization of Staeless subqueries Each input tuple can be processed by any node Load balancer applies round-robin to distribute

Scalability Parallelization of Stateful Subqueries Join and Aggregate (group-by) Each input stream split by LB into N substreams hash(A)%N to distribute tuples Cartesian Join Each tuple is sent to M=sqrt(N) nodes %M to distribute

Scalability

Transparency Parallelization result should equal to non parallel version Input Merger: takes timestamp ordered substreams from LB and generate ordered substream Optimisations Merge stateful subqueries if they share same aggregation method Merge union with IM, filter with LB

Evaluation Targets to measure the scalability The number of processors The window size Methodology Increasing input loads for different configurations StreamCloud instances process tuples until it overloads Throughput: tuples/comparisons per second CPU usage, queue length

Evaluation setup 60 nodes with 160 cores Multiple instances of StreamCloud per node for multi-core nodes Baselines: centralised SPE on one node; two StreamClound instances on one node

Evaluation Plan Scalability of each individual operator Scalability of full queries Comparison with query-cluster and operator cluster strategies Increase system size while maintain fixed window size to handle increased input node Scalability in terms of numbers of instances per node

Crazy charts

Crazy charts explained Operators scale well Subquery-cluster is 2.5 to 5 times better than query-cluster and operator cluster Scale with cores too Scalability maximised!

My thoughts ++ Subquery-cluster strategy provides better scalability Load-balancer & Input-merger implemented with standard stream operators Detailed evaluations over real implementation (albeit crazy charts)

My thoughts -- Other operators? (e.g. Bsort, ReSample) How does it handle network imperfections? Delayed, missing, out-of-order data Broken node Independence unproven. What about other SPEs? Evaluations do not contain comparison with other systems

Questions? ??? ?? ?

StreamCloud: A Large Scale Data Streaming System Gulisano, Vincenzo - PowerPoint PPT Presentation

StreamCloud: A Large Scale Data Streaming System Gulisano, Vincenzo Jimenez-Peris, Ricardo Patino-Martinez, Marta Valduriez, Patrick Rokey Ge Outline The need for Data Stream Processing Current Stream Processing Engines Introducing

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

A large-scale chemical data integration system Gaia Paolini Pfizer Confidential 1 Large-Scale

LIVE STREAMING AT SCALE Jordi Cenzano | Director of engineering mmsys2019

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

RAY A Scalable computation engine Ray is a flexible, high-performance, distributed

Py Pyro: A Spa patial-Tempo mporal Big-Data Storage System m Shen Li Shaohan Hu Raghu Ganti

EPAs benthic habitat data for Yaquina estuary Presented by Ted DeWitt Data Set Contributors:

Beta Presentation Amazon Data Hub The Capstone Experience Team Amazon Josh Barnett Austin

Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber

Alpha Presentation Using Sensors to Study Human Behavior The Capstone Experience Team Michigan

INTEGRATION VALERIY KOSYKH, EVGENII VIAZILOV, ALEXANDER STERIN, OLGA BULYGINA RIHMI-WDC, OBNINSK

Are p and q connected? Network connectivity Yes, they are connected! Network connectivity