How am I going to skim through these data ? 1 Trends Computers - PowerPoint PPT Presentation

How am I going to skim through these data …? 1

Trends • Computers keep getting faster • But data grows faster yet! – Remember? BIG DATA! • Queries are becoming more complex – Remember? ANALYTICS! 2

Analytic Queries • Analytic queries typically involve aggregates • Simple query – report the total • A CUBE operator in OLAP – allows sales of each region users to drill down or roll up between multiple nodes of the data SELECT SUM(S.sales) cube operation FROM SALES S GROUP BY region SELECT SUM(S.sales) FROM SALES S A query involving multiple aggregates • GROUP BY CUBE(pid, locid, timeid) ─ find the average supplier ‐ quantity supplied by suppliers of a particular {pid, locid, timeid} part SELECT AVG(quantity) {pid, locid} {pid, timeid} {locid, timeid} FROM (SELECT supp, part, SUM(quantity) as quantity FROM lineitem {pid} {locid} {timeid} WHERE part = 10 GROUP BY supp, part); { } 3

System perspective • Aggregation queries – read a large amount of data – a long time to compute – return a very small final result • answers are or derived from summary data 4 4

User perspective • “Big Picture” analytics – Decision makers want to know something about some data quickly – Precise answers typically not required; approximate results are ok – Real ‐ time interaction and control over processing – Visually oriented interface Time System 1 System 2 System 3 1.0000 3.01325 4.32445 7.5654 2.0000 4.54673 6.56784 8.6562 VS 3.0000 5.46571 6.87658 10.3343 5

Challenge Mismatch between user needs and system functionality 6

Drawbacks of Current Systems • Only exact answers are available – A losing proposition as data volume grows – Hardware improvements not sufficient • HCI solution: interactive tools don’t do big jobs – E.g., spreadsheet programs (1 M row, 16k column limit) • Systems solution: big jobs aren’t interactive – No user feedback or control in big DBMS queries (“back to the 60’s”) – Long processing time – Fundamental mismatch with preferred modes of HCI • Best solutions to date – precompute (store answers of queries beforehand), e.g. OLAP – Don’t handle ad hoc queries or data sets well 7

Desirable features for Big Picture Analytics • Early (approximate) answers with guarantees! • Refinement over time • Interaction and ad ‐ hoc control (human in the loop) 100% Online  Traditional Did you see any problem with “Online”? Time 8

Example (Conventional) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 1 second 30 seconds 5 minutes Conventional database 9

Example (Conventional) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 minutes Avg Stock Price = 1000 Conventional database 10

Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 1 second Avg Stock Price $2031+/ ‐ $523 90% 5% Sampling Progress With online aggregation 11

Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 seconds Avg Stock Price $1890+/ ‐ $420 95% 15% Sampling Progress With online aggregation 12

Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 5 minutes Avg Stock Price $1150+/ ‐ $210 97% 40% Sampling Progress With online aggregation 13

Example (online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 minutes Avg Stock Price $1040+/ ‐ $70 99% 95% Sampling Progress With online aggregation 14

Example: Online Aggregation Additional Features: Speed up Slow down Terminate 15

Example: Online visualization 16

Example: Browsing 17

Key benefit: “Premature” termination After 5 minutes • If acceptably accurate answer reached quickly, the query can be aborted Conventional database Avg Stock Price $1150+/ ‐ $210 97% 40% Sampling Progress With online aggregation Stop early 18

Why Stop Early?? • Save human time (30 min vs 5 min) – Precise vs estimate answers • For exploratory applications • Save machine time – Save cost $$ • Very important when dealing with BIG DATA in the cloud – Pay for what you used (users need to justify the cost to the organization) 19

Analytic queries are costly! QphH = Query ‐ per ‐ Hour Performance 20 20

Solution: Online aggregation • Users must get continual feedback on results of processing – Observe the progress of their queries – Give continually improving partial results: aggregates have running output and confidence interval • Control execution on ‐ the ‐ fly For a retailer, approximate result, Average Sales: $21,712+/ ‐ $47 $21,795+/ ‐ $105 $21,255+/ ‐ $286 $22,131+/ ‐ $523 85% 90% 95% 98% such as $21,712+/ ‐ $47, can provide a good estimation for its 40% 35% 20% 15% daily sale’s statistics. And it is Sampling Progress more cost effective. 21 21

Statistical estimation • Users do not need to set a priori specification of stopping condition • The interface is easier for users with no statistical background • It requires more powerful statistical estimation techniques (Hoeffding’s inequality versus Chebyshev’s inequality) 22 22

Usability goals • Continuous observation • Control of time/precision • Control of fairness/partiality 23 23

Performance goals • Minimum time to accuracy – produce a useful estimate of the final answer ASAP • Minimum time to completion – secondary goal, assume user will terminate processing long before the final answer is produced • Pacing – guarantee a smooth and continuous improving display 24 24

What are the tools?? 25

Random access to data • We need to retrieve data in random order to produce meaningful statistical estimation • At any time, the input to the query is a sample • Input grows over time until • the query is terminated prematurely or • all data examined 26 26

Sampling – design issues • Granularity of sample – Record ‐ level: high I/O cost – Block ‐ level: high variability from clustering • Types of sample – Often simple random sample (SRS) – With/without replacement?? • Data structures from which to sample – Files or relational tables – Indexes (B + trees, etc) 27

Row ‐ level sampling techniques • Maintain file in (pseudo) random order – Sampling = scan – Is file initially in random order? • Statistical tests needed: e.g., Runs test, Kolmogorov ‐ Smirnov test – Can start scans from random positions • Best I/O behavior – Must “freshen” ordering (online reorg) • On ‐ the ‐ fly sampling – Index scan via index on “random” column • Indexed attributes are different from (and not correlated to) aggregated attributes, e.g., name is not correlated to salary – Else get random page, then row within page • Less efficient • Problem: variable number of records on page 28

Sampling from Index Root 17 27 5 13 30 2* 3* 33* 34* 38* 39* 5* 7* 8* 22* 24* 27* 29* 14* 16* 29

How expensive is row ‐ level sampling? 100 Pages fet ched (% ) 80 60 40 20 0 2 4 6 8 2 4 6 8 0 1 2 . . . . . . . . 0 0 0 0 1 1 1 1 Sampling Rat e (% ) 30

Group By operation • Aggregate queries involve “grouping”, e.g., count of grades of students, average sales per month, etc • How are “group ‐ by” queries processed? – Sort on “group ‐ by” attributes, e.g., sort by grades, then count – Hash on “group ‐ by” attributes • Same attribute values will be hashed to the same bucket, e.g., students with the same grade will be grouped together 31

Non ‐ blocking GROUP BY and DISTINCT • Blocking operator – Cannot work on another operator while the current operator is being processed • Sorting is a blocking algorithm – only one group is computed at a time after sorting • Hashing is non ‐ blocking, but … – hash table need to fit in memory to have good performance 32 32

Index striding • For fair Group By: – Want to update all groups • Challenging for groups with small number of records – Want random tuple from Group 1, random tuple from Group 2, ... – Idea • Index gives tuples from a single group • Opens many cursors in index, one per group • Fetch records in round ‐ robin • Can control speed by weighting the schedule – Gives fairness/partiality, info/speed match! 33 33

Conclusion • Big data analytics is becoming increasingly important • Online aggregation (OLA) is a promising direction • OLA mechanisms – Random sampling – Non ‐ blocking schemes – Index striding • Online Hadoop is a variant of Hadoop that offers a flavor of “online aggregation” 34

How am I going to skim through these data ? 1 Trends Computers - PowerPoint PPT Presentation

How am I going to skim through these data ? 1 Trends Computers keep getting faster But data grows faster yet! Remember? BIG DATA! Queries are becoming more complex Remember? ANALYTICS! 2 Analytic Queries Analytic

Multiple Component Pricing Federal Order 5 Skim/Butterfat vs. Component Pricing Handlers

IC220 Read 3.6 (Floating point skim details on addition, multiplication, rounding,

The Patient Journey Evolution A Case Study Alex Zhu & Bill Salokar SKIM 1 Digital is

Steam-jet agglomeration of skim-milk powders : influence of the process parameters M. Person 12 ;

Short presentation of SKIM Presentation January 2017 CITATIONS READS 0 320 1 author:

Bayes Nets AI Class 10 (Ch. 14.114.4.2; skim 14.3) Weather Cavity Toothache Catch Based on

High Performing Darren Steeves Meet Aunt Hilda They are going to This is going to be love me

'All these emotions, all these yearnings, all these data': Platform openness, data sharing and

PSPA Talk Alistair Church Dept Neurology Royal Gwent Hospital What are you going to do about it

Were going to spend the next 45 minutes talking about hops. Specifically, were going to be

Going Concern Audit Opinions in the Age of Covid-19 Some Thoughts Marshall A. Geiger Going

The world ld is is going going downhill ll The world ld is is getting better Theres a lot

Keep Communication Going Please download the Keep Communication Going Worksheet and Handouts 1

If we control these we can monitor & influence these If we control these

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

OWL-P: Processes = Protocols + Policies Munindar P. Singh ( Students: Amit K. Chopra, Nirmit V.

From IoT to Cloud Via Fog and SDN Antonio Brogi Stefano Forti Federica Paganelli

Could Everything Be A Process? Antony Galton Department of Computer Science University of

HFES Webinar Series How Do You Know That Your Metrics Work? Fundamental Psychometric Approaches to

Opinion Mining Feiyu XU & Xiwen CHENG feiyu@dfki.de DFKI, Saarbruecken, Germany January 4,

capturing natural interactions Nick Campbell Trinity College, Dublin Clarin/FLaReNet

FULL HBL S TAY H O M E . S TAY S A F E . S TAY C U R I O U S . S TAY H O M E . S TAY S A F E

Le Chateau du Broutel 7 th to 11 th November 2016 1 Le Chateau du Broutel 2 Le Chateau du

How am I going to skim through these data ? 1 Trends Computers - PowerPoint PPT Presentation

How am I going to skim through these data ? 1 Trends Computers keep getting faster But data grows faster yet! Remember? BIG DATA! Queries are becoming more complex Remember? ANALYTICS! 2 Analytic Queries Analytic

Multiple Component Pricing Federal Order 5 Skim/Butterfat vs. Component Pricing Handlers

IC220 Read 3.6 (Floating point skim details on addition, multiplication, rounding,

The Patient Journey Evolution A Case Study Alex Zhu &amp; Bill Salokar SKIM 1 Digital is

Steam-jet agglomeration of skim-milk powders : influence of the process parameters M. Person 12 ;

Short presentation of SKIM Presentation January 2017 CITATIONS READS 0 320 1 author:

Bayes Nets AI Class 10 (Ch. 14.114.4.2; skim 14.3) Weather Cavity Toothache Catch Based on

High Performing Darren Steeves Meet Aunt Hilda They are going to This is going to be love me

'All these emotions, all these yearnings, all these data': Platform openness, data sharing and

PSPA Talk Alistair Church Dept Neurology Royal Gwent Hospital What are you going to do about it

Were going to spend the next 45 minutes talking about hops. Specifically, were going to be

Going Concern Audit Opinions in the Age of Covid-19 Some Thoughts Marshall A. Geiger Going

The world ld is is going going downhill ll The world ld is is getting better Theres a lot

Keep Communication Going Please download the Keep Communication Going Worksheet and Handouts 1

If we control these we can monitor &amp; influence these If we control these

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

OWL-P: Processes = Protocols + Policies Munindar P. Singh ( Students: Amit K. Chopra, Nirmit V.

From IoT to Cloud Via Fog and SDN Antonio Brogi Stefano Forti Federica Paganelli

Could Everything Be A Process? Antony Galton Department of Computer Science University of

HFES Webinar Series How Do You Know That Your Metrics Work? Fundamental Psychometric Approaches to

Opinion Mining Feiyu XU &amp; Xiwen CHENG feiyu@dfki.de DFKI, Saarbruecken, Germany January 4,

capturing natural interactions Nick Campbell Trinity College, Dublin Clarin/FLaReNet

FULL HBL S TAY H O M E . S TAY S A F E . S TAY C U R I O U S . S TAY H O M E . S TAY S A F E

Le Chateau du Broutel 7 th to 11 th November 2016 1 Le Chateau du Broutel 2 Le Chateau du

The Patient Journey Evolution A Case Study Alex Zhu & Bill Salokar SKIM 1 Digital is

If we control these we can monitor & influence these If we control these

Opinion Mining Feiyu XU & Xiwen CHENG feiyu@dfki.de DFKI, Saarbruecken, Germany January 4,