Challenges for Data Driven Systems Eiko Yoneki University of - PDF document

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management � 4000 B C Manual recording � From tablets to papyrus… to paper A. Payberah’2014 2

1800's - 1940's � Punched cards (no fault-tolerance) � Binary data � 1911: IBM appeared A. Payberah’2014 3 1940's - 1970's � Magnetic tapes � Batch transaction processing � Hierarchical DBMS � Network DBMS 4 A. Payberah’2014

1980's � Relational DBMS (tables) and SQL � ACID (Atomicity Consistency Isolation Durability) � Client-server computing � Parallel processing A. Payberah’2014 5 1990's - 2000's � The Internet... A. Payberah’2014 6

2010's � NoSQL: BASE instead of ACID B asic A vailability, S oft-state, E ventual consistency � Big Data is emerging! A. Payberah’2014 7 Emergence of Big Data � Increase of Storage Capacity � Increase of Processing Capacity � Availability of Data � Hardware and software technologies can manage ocean of data 8

Challenge to process Big Data � Integration of complex data processing with programming, networking and storage � A key vision for future computing 9 Big Data: Technologies � Distributed infrastructure � Cloud (e.g. Infrastructure as a service) cf. Multi-core (parallel computing) � Storage � Distributed storage (e.g. Amazon S3) � Data model/ indexing � High-performance schema-free database (e.g. NoSQL DB) � Programming Model � Distributed processing (e.g. MapReduce) � Operations on big data � Analytics 10

Big Data: Technologies � Distributed infrastructure � Cloud (e.g. Infrastructure as a service) � Storage � Distributed storage (e.g. Amazon S3) � Data model/ indexing � High-performance schema-free database (e.g. NoSQL DB) � Programming Model � Distributed processing (e.g. MapReduce) � Operations on big data � Analytics – Realtime Analytics 11 Distributed Infrastructure manage Zookeeper, Chubby 12

Distributed Infrastructure � Computing + Storage transparently � Cloud computing, Web 2.0 � Scalability and fault tolerance � Distributed servers � Amazon EC2, Google App Engine, Elastic, Azure � System? OS, customisations � Sizing? RAM/ CPU based on tiered model � Storage? Quantity, type � Distributed storage � Amazon S3 � Hadoop Distributed File System (HDFS) � Google File System (GFS), BigTable… 13 Challenges � Distribute and shard parts over machines � Still fast traversal and read to keep related data together � Scale out instead scale up � Avoid naïve hashing for sharding � Do not depend on the number of node � But difficult add/ remove nodes � Trade off – data locality, consistency, availability, read/ write/ search speed, latency etc. � Analytics requires both real time and post fact analytics – and incremental operation 14

Big Data: Technologies � Distributed infrastructure � Cloud (e.g. Infrastructure as a service) � Storage � Distributed storage (e.g. Amazon S3) � Data model/ indexing � High-performance schema-free database (e.g. NoSQL DB) � Programming Model � Distributed processing (e.g. MapReduce) � Operations on big data � Analytics – Realtime Analytics 15 Data Model/ Indexing � Support large data � Fast and flexible access to data � Operate on distributed infrastructure � Is SQL Database sufficient? 16

NoSQL (Schema Free) Database � NoSQL database � Operate on distributed infrastructure � Based on key-value pairs (no predefined schema) � Fast and flexible � Pros: Scalable and fast � Cons: Fewer consistency/ concurrency guarantees and weaker queries support � Implementations � MongoDB, CouchDB, Cassandra, Redis, BigTable, Hibase … 17 Big Data: Technologies � Distributed infrastructure � Cloud (e.g. Infrastructure as a service) � Storage � Distributed storage (e.g. Amazon S3) � Data model/ indexing � High-performance schema-free database (e.g. NoSQL DB) � Programming Model � Distributed processing (e.g. MapReduce) � Stream processing � Operations on big data � Analytics – Realtime Analytics 18

Distributed Processing � Non standard programming models � No traditional parallel programming models (e.g. MPI) � e.g. MapReduce � Data (flow) parallel programming � e.g. MapReduce, Dryad/ LINQ, NAIAD, Spark 19 MapReduce � Target problem needs to be parallelisable � Split into a set of smaller code (map) � Next small piece of code executed in parallel � Results from map operation get synthesised into a result of the original problem (reduce) 20

CIEL: Dynamic Task Graph � Data-dependent control flow � CIEL: Execution engine for dynamic task graphs (D. Murray et al. C IEL : a universal execution engine for distributed data-flow computing, NSDI 2011) 21 Stream Data Processing � Stream Data Processing � Stream: infinite sequence of { tuple, timestamp} pairs � Continuous query: result of query in unbounded stream � Database systems and Data stream systems � Database � Mostly static data, ad-hoc one-time queries � Store and query � Data stream � Mostly transient data, continuous queries 22

Real-Time Data � Departure from traditional static web pages � New time-sensitive data is generated continuously � Rich connections between entities � Challenges: � High rate of updates � Continuous data mining - Incremental data processing � Data consistency 23 Big Data: Technologies � Distributed infrastructure � Cloud (e.g. Infrastructure as a service) � Storage � Distributed storage (e.g. Amazon S3) � Data model/ indexing � High-performance schema-free database (e.g. NoSQL DB) � Programming Model � Distributed processing (e.g. MapReduce) � Operations on big data � Analytics 24

Techniques for Analysis � Applying these techniques: larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones � � Pattern recognition Classification � � Predictive modelling Cluster analysis � � Regression Crowd sourcing � � Sentiment analysis Data fusion/ integration � � Signal processing Data mining � � Spatial analysis Ensemble learning � � Statistics Genetic algorithms � � Supervised learning Machine learning � � Simulation NLP � � Time series analysis Neural networks � � Unsupervised learning Network analysis � � Visualisation Optimisation 25 Typical Operation with Big Data � Smart sampling of data � Reducing data with maintaining statistical properties � Find similar items � Efficient multidimensional indexing � Incremental updating of models � Distributed linear algebra � dealing with large sparse matrices � Plus usual data mining, machine learning and statistics � Supervised (e.g. classification, regression) � Non-supervised (e.g. clustering..) 26

Do we need new Algorithms? � Can’t always store all data � Online/ streaming algorithms � Memory vs. disk becomes critical � Algorithms with limited passes � N 2 is impossible � Approximate algorithms 27 Easy Cases � Sorting � Google 1 trillion items (1PB) sorted in 6 Hours � Searching � Hashing and distributed search � Random split of data to feed M/ R operation � BUT Not all algorithms are parallelisable 28

More Complex Case: Stream Data � Have we seen x before? � Rolling average of previous K items � Hot list–most frequent items seen so far � Probability start tracking new item � Querying data streams � Continuous Query 29 Big Graph Data Bipartite graph of Airline Graph appearing phrases Social Networks in documents Gene expression data Protein Interactions [ genomebiology.com] 30

How to Process Big Graph Data? � Data-Parallel (MapReduce, DryadLINQ) � Partitioned across several machines and replicated � No efficient random access to data � Graph algorithms are not fully parallelisable � Parallel DB � Tabular format providing ACID properties � Allow data to be partitioned and processed in parallel � Graph does not map well to tabular format � Moden NoSQL � Allow flexible structure (e.g. graph) � Trinity, Neo4J, HyperGraphDB � In-memory graph store for improving latency (e.g. Redis, Scalable Hyperlink Store (SHS)) 31 Big Graph Data Processing � MapReduce is ill-suited for graph processing � Many iterations are needed � Intermediate results at every iteration harm performance � Graph specific data parallel � Vertex-based iterative computation model � Iterative algorithms common in ML and graph analysis 32

Big Data Analytics Stack A. Payberah’2014 33 Big Data Analytics Stack 34 A. Payberah’2014

Topic Areas Session 1: Introduction Session 2: Programming in Data Centric Environment Session 3: Processing Models of Large-Scale Graph Data Session 4: Map/ Reduce Hands-on Tutorial with EC2 Session 5: Optimisation in Graph Data Processing + Guest lecture Session 6: Stream Data Processing + Guest lecture Session 7: Scheduling Irregular Tasks Session 8: Project study presentation 35 Summary � R212 course web page: www.cl.cam.ac.uk/ ~ ey204/ teaching/ ACS/ R212_2014_2015 � Enjoy the course! 36

Challenges for Data Driven Systems Eiko Yoneki University of - PDF document

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah2014 2 1800's - 1940's Punched

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

False fasting is driven by pride False fasting is driven by pride False fasting is

EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in

Data-Driven Research Program Data-Driven Research Program Linked Longitudinal Retrospective

SCE Map Update: Data-Driven Spatial and E Field Maps Michael Mooney, Hannah Rogers Colorado

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data

BSD Data Systems Report to the School Board, April 2019 Data-Driven ESSA Best Practices Data

Gillian Smith September 13, 2012 gillian@ccs.neu.edu Graphics-Driven Game Design

EVENT-DRIVEN AND DATA-DRIVEN CONTROL AND OPTIMIZATION IN CYBER-PHYSICAL SYSTEMS C. G. Cassandras

Domain Driven Domain Driven Design with relational Design with relational Databases and Spring

Data Driven Marketing the DNA of customer oriented companies 00101001 yes no Data Driven

Data Collection and Aggregation Data Collection and Aggregation 1 Challenges: data Challenges:

Meeting the Challenges of Ultra- -Large Large- - Meeting the Challenges of Ultra Scale Systems

1 Data-dr Data-driven philosophy n philosophy Data-dr Data-driven: push n: push 7 8

CS 528 Mobile and Ubicomp Lecture 3a: Data-Driven Layouts & Android Components Emmanuel Agu

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

The ABCs of ACOs for MCH May 30, 2013 For assistance: Please contact cmccoy@amchp.org or for web

Fog Networks Mung Chiang Princeton University 2015 From

CONFERENCE CALL Q1 2015 April 23, 2015 Forward-Looking Statements This presentation and its

Lessons and Predictions from 25 Years of Parallel Data Systems Development PARALLEL DATA STORAGE

Injury, Psychiatric Illness and the Developing Brain: AT THE INTERSECTION OF JUVENILE JUSTICE

CS 449: Human-Computer Interaction Spring 2013 Edward Lank MC 4063 Take Aways Quick

2 This work is funded under National Data Storage 2 project (2011-2013), Project number

? H?N? H2N2 H1N1 H1N1 1889 1918 1957 1977 2009 1968 Jeffery K. Taubenberger, M.D., Ph.D.

Challenges for Data Driven Systems Eiko Yoneki University of - PDF document

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah2014 2 1800's - 1940's Punched

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

False fasting is driven by pride False fasting is driven by pride False fasting is

EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in

Data-Driven Research Program Data-Driven Research Program Linked Longitudinal Retrospective

SCE Map Update: Data-Driven Spatial and E Field Maps Michael Mooney, Hannah Rogers Colorado

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data

BSD Data Systems Report to the School Board, April 2019 Data-Driven ESSA Best Practices Data

Gillian Smith September 13, 2012 gillian@ccs.neu.edu Graphics-Driven Game Design

EVENT-DRIVEN AND DATA-DRIVEN CONTROL AND OPTIMIZATION IN CYBER-PHYSICAL SYSTEMS C. G. Cassandras

Domain Driven Domain Driven Design with relational Design with relational Databases and Spring

Data Driven Marketing the DNA of customer oriented companies 00101001 yes no Data Driven

Data Collection and Aggregation Data Collection and Aggregation 1 Challenges: data Challenges:

Meeting the Challenges of Ultra- -Large Large- - Meeting the Challenges of Ultra Scale Systems

1 Data-dr Data-driven philosophy n philosophy Data-dr Data-driven: push n: push 7 8

CS 528 Mobile and Ubicomp Lecture 3a: Data-Driven Layouts &amp; Android Components Emmanuel Agu

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

The ABCs of ACOs for MCH May 30, 2013 For assistance: Please contact cmccoy@amchp.org or for web

Fog Networks Mung Chiang Princeton University 2015 From

CONFERENCE CALL Q1 2015 April 23, 2015 Forward-Looking Statements This presentation and its

Lessons and Predictions from 25 Years of Parallel Data Systems Development PARALLEL DATA STORAGE

Injury, Psychiatric Illness and the Developing Brain: AT THE INTERSECTION OF JUVENILE JUSTICE

CS 449: Human-Computer Interaction Spring 2013 Edward Lank MC 4063 Take Aways Quick

2 This work is funded under National Data Storage 2 project (2011-2013), Project number

? H?N? H2N2 H1N1 H1N1 1889 1918 1957 1977 2009 1968 Jeffery K. Taubenberger, M.D., Ph.D.

CS 528 Mobile and Ubicomp Lecture 3a: Data-Driven Layouts & Android Components Emmanuel Agu