CS 6453: StreamScope Soumya Basu March 7, 2017 Motivation - PowerPoint PPT Presentation

Aug 15, 2022 •481 likes •635 views

CS 6453: StreamScope Soumya Basu March 7, 2017 Motivation Streaming data is everywhere! Updates on Facebook Shopping on Alibaba Singles Day in China: 50 million events per sec, 3 second latency Streaming Problem Infinite

CS 6453: StreamScope Soumya Basu March 7, 2017
Motivation • Streaming data is everywhere! • Updates on Facebook • Shopping on Alibaba • Singles Day in China: 50 million events per sec, 3 second latency
Streaming Problem • Infinite stream of input events to process • Want to produce output events in a timely fashion • Stream processing is rather complex • However, there are key constraints (e.g. cannot keep per-event state around)
Prior Works • Many pieces of the StreamScope paper are lifted from prior works • SQL-like programming interface • Compiling and optimizing the program to a DAG • Scheduling tasks on a cluster
Related Work • Extending batch processing systems to streaming • MapReduce Online, S4, Storm • Different design dimensions explored in stream processing: • Photon, Jetstream: geo-distribution • Naiad, Flink: Dataflows with cycles
Where is this work new? • Strong consistency, high scalability, and a cleaner abstraction • The latter allows for easily reasoning about many other problems
Model • Every stream computation can be broken up using 2 types of components: • Streams: Which are ordered lists of events • Vertices: Read from many input streams, produce one output stream • TODO: Insert picture here of model
Key Idea: Reliability • Make both components reliable and consistent • Called rVertex and rStream in the paper • Assumption on rVertex: the programs written are deterministic • Reliability allows for easy reasoning to solve many other problems
Failure Recovery: rVertex • Failure Recovery has only two cases! • Option 1: Periodic snapshots taken during steady state • Upon failure, restore to recent snapshot and read next events from stream • Option 2: Run many copies of the same rVertex
Failure Recovery: rStream • Asynchronously flush stream state to disk • If stream fails, recompute recent events from incoming rVertex • Again, determinism assumption used heavily here!
Stragglers • Much larger problem in stream processing • A straggler can cause slowdown long after it’s no longer a problem • Handled the same way as failures: • Spin up new rVertex in parallel with the original • Kill the slow one after a while • Benefit: doesn’t sacrifice latency for slow events
Other Issues • Handling bursts with rStream is trivial since the underlying storage is on disk • Maintenance handled like a failure/straggler • Time traveling and replay is possible by storing old rStream/rVertex state
Evaluation
Limitations • Nondeterminism • Input streams are often nondeterministic (e.g. a click stream) • Reliability issues still exist in this system • Many consistency issues are folded in this assumption
What Next? • How do we handle nondeterminism efficiently? • Is there a way to capture all nondeterministic sources? • Can rVertex and rStream abstractions be extended to cycles as well? • What’s the inherent difficulty in doing that?

Recommend

Batch Processing Natacha Crooks - CS 6453 Data (usually) doesnt fit on a single machine

Batch Processing Natacha Crooks - CS 6453 Data (usually) doesnt fit on a single machine CoinGraph (900GB) LiveJournal (1.1GB) Orkut (1.4GB) Twitter (between 5 and 20GB) Netflix Recommendation (2.5BGB) Sources: Musketer (Eurosys15),

1k views • 32 slides

CS 6453 Network Fabric Presented by Ayush Dubey Based on: 1. Jupiter Rising: A Decade of Clos

CS 6453 Network Fabric Presented by Ayush Dubey Based on: 1. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Googles Datacenter Network. Singh et al. SIGCOMM15. 2. Network Traffic Characteristics of Data Centers in

905 views • 57 slides

CS 6453 LECTURE 6: MESOS PLATFORM REUBEN RAPPAPORT WHAT IS THE PROBLEM? There are many

CS 6453 LECTURE 6: MESOS PLATFORM REUBEN RAPPAPORT WHAT IS THE PROBLEM? There are many existing frameworks for cluster computing Generally, different frameworks are best for each application Obvious problem: How to share

194 views • 17 slides

CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for

CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature (1, 1, 1) Training Extraction (2, -1, 3) (5, 6, 7)

520 views • 21 slides

CS 6453: Geode and Clarinet Soumya Basu April 13, 2017 Motivation Motivation Status Quo Tens

CS 6453: Geode and Clarinet Soumya Basu April 13, 2017 Motivation Motivation Status Quo Tens of datacenters 100s of Terabytes of bandwidth! Why is this a problem? Application demands are growing Wide Area Network capacity is

311 views • 27 slides

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir Bugay, Anton Karpenko Knoema Corporation INTRODUCTION Knoema is a global data aggregator and a search engine for data Our search operates

574 views • 18 slides

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern Methodist University January 23, 2019 Michael Hahsler (SMU/Lyle) Data Stream Mining January 23, 2019 1 / 36 Table of Contents Introduction 1

1.5k views • 36 slides

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 2: From MapReduce to

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 2: From MapReduce to Spark (1/2) January 22, 2019 Adam Roegiest Kira Systems These slides are available at http://roegiest.com/bigdata-2019w/ This work is licensed

1.02k views • 67 slides

Building dev tools at the right level of abstraction Ben Davis CTO @BenCDavis

Building dev tools at the right level of abstraction Ben Davis CTO @BenCDavis ben@gatherdata.co The data engineering industry is very fragmented. Gather is a data integration tool for developers . It makes it really easy to build integration

728 views • 29 slides

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2. Review 3. Random Networks 4. Random network generation and comparisons 2 / 11 Todays Biz 1. Reminders 2. Review 3. Random Networks 4. Random

816 views • 58 slides

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010 Todays Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic

822 views • 30 slides

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian Hueske, Aljoscha Krettek , Kostas Tzoumas Database Systems and Information Management Technische Universitt Berlin

584 views • 21 slides

Response prediction using collaborative filtering with hierarchies and side-information Aditya

Response prediction using collaborative filtering with hierarchies and side-information Aditya Krishna Menon 1 Krishna-Prasad Chitrapura 2 Sachin Garg 2 Deepak Agarwal 3 Nagaraj Kota 2 1 UC San Diego 2 Yahoo! Labs Bangalore 3 Yahoo! Research Santa

578 views • 36 slides

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity Measure of Queries Using Historical Click- - of Queries Using Historical Click- of Queries Using Historical Click through Data through Data

481 views • 23 slides

Deep Character-Level Bora Edizel - Phd Student UPF Click-Through Rate Prediction Amin Mantrach -

Deep Character-Level Bora Edizel - Phd Student UPF Click-Through Rate Prediction Amin Mantrach - Criteo Research for Sponsored Search Xiao Bai - Oath This work was done at Yahoo and will be presented as full paper in SIGIR '17. Outline

1.09k views • 48 slides

4 Idiots Approach for Click-through Rate Prediction 1/15 Team Members 4 Idiots consist of:

4 Idiots Approach for Click-through Rate Prediction 1/15 Team Members 4 Idiots consist of: Name Kaggle ID Affiliation Yu-Chin Juan guestwalk National Taiwan University Wei-Sheng Chin mandora National Taiwan University Yong Zhuang

414 views • 15 slides

Web Mining and Recommender Systems Algorithms for advertising Learning Goals Introduce the

Web Mining and Recommender Systems Algorithms for advertising Learning Goals Introduce the topic of algorithmic advertising Classification Predicting which ads people click on might be a classification problem Will I click on this ad?

1.24k views • 104 slides

Dynamic Marginal Contribution Mechanism Dirk Bergemann and Juuso Vlimki DIMACS: Economics and

Dynamic Marginal Contribution Mechanism Dirk Bergemann and Juuso Vlimki DIMACS: Economics and Computer Science October 2007 Dirk Bergemann and Juuso Vlimki DIMACS: Economics and Computer Science Dynamic Marginal Contribution Mechanism

506 views • 39 slides

Introd u ction to click - thro u gh rates P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G

Introd u ction to click - thro u gh rates P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor Click - thro u gh rates Click - thro u gh rate : # of clicks on ads / # of v ie w s of ads Companies and

948 views • 18 slides

Designing Auctions for Search Ads Kshipra Bhawalkar Lane (Google Research) Joint work with Gagan

Designing Auctions for Search Ads Kshipra Bhawalkar Lane (Google Research) Joint work with Gagan Aggarwal, Aranyak Mehta With input from various Google Research Scientists and Engineers Rich Ad Auctions Old Search Ads New Search Ads 2

708 views • 54 slides

Performability at Yahoo Search Amr Awadallah and a bunch of other yahoos amr@yahoo-inc.com Now,

Performability at Yahoo Search Amr Awadallah and a bunch of other yahoos amr@yahoo-inc.com Now, A word from our sponsor What is Yahoo Search ? Web Results (Served by Google) Direct Display (Yahoo Content) Inside Yahoo (Yahoo

233 views • 8 slides

Measurement and Analysis of OSN Ad Auctions Yabing Liu Chloe Kliman-Silver Robert Bell

Measurement and Analysis of OSN Ad Auctions Yabing Liu Chloe Kliman-Silver Robert Bell Balachander Krishnamurthy Alan Mislove Northeastern University AT&T LabsResearch Brown University Motivation Online

454 views • 30 slides

Making the Leap John Donham Raph Koster Game Developers Conference Online October 2010 All data

AAA to Social Games, Making the Leap John Donham Raph Koster Game Developers Conference Online October 2010 All data from non-confidential sources publicly available and listed at the end of this presentation Metaplace Playdom Disney About

922 views • 51 slides

A D R U PA L E R S G U I D E T O M A R K E T I N G @dgorton #Marketing4Drupalers D R

T W I N C I T I E S D R U PA L C A M P A D R U PA L E R S G U I D E T O M A R K E T I N G @dgorton #Marketing4Drupalers D R E W G O R T O N Director of Developer Relations at Pantheon Web Developer late 90s. Marketing

671 views • 20 slides