How am I going to skim through these data …? 1
Trends • Computers keep getting faster • But data grows faster yet! – Remember? BIG DATA! • Queries are becoming more complex – Remember? ANALYTICS! 2
Analytic Queries • Analytic queries typically involve aggregates • Simple query – report the total • A CUBE operator in OLAP – allows sales of each region users to drill down or roll up between multiple nodes of the data SELECT SUM(S.sales) cube operation FROM SALES S GROUP BY region SELECT SUM(S.sales) FROM SALES S A query involving multiple aggregates • GROUP BY CUBE(pid, locid, timeid) ─ find the average supplier ‐ quantity supplied by suppliers of a particular {pid, locid, timeid} part SELECT AVG(quantity) {pid, locid} {pid, timeid} {locid, timeid} FROM (SELECT supp, part, SUM(quantity) as quantity FROM lineitem {pid} {locid} {timeid} WHERE part = 10 GROUP BY supp, part); { } 3
System perspective • Aggregation queries – read a large amount of data – a long time to compute – return a very small final result • answers are or derived from summary data 4 4
User perspective • “Big Picture” analytics – Decision makers want to know something about some data quickly – Precise answers typically not required; approximate results are ok – Real ‐ time interaction and control over processing – Visually oriented interface Time System 1 System 2 System 3 1.0000 3.01325 4.32445 7.5654 2.0000 4.54673 6.56784 8.6562 VS 3.0000 5.46571 6.87658 10.3343 5
Challenge Mismatch between user needs and system functionality 6
Drawbacks of Current Systems • Only exact answers are available – A losing proposition as data volume grows – Hardware improvements not sufficient • HCI solution: interactive tools don’t do big jobs – E.g., spreadsheet programs (1 M row, 16k column limit) • Systems solution: big jobs aren’t interactive – No user feedback or control in big DBMS queries (“back to the 60’s”) – Long processing time – Fundamental mismatch with preferred modes of HCI • Best solutions to date – precompute (store answers of queries beforehand), e.g. OLAP – Don’t handle ad hoc queries or data sets well 7
Desirable features for Big Picture Analytics • Early (approximate) answers with guarantees! • Refinement over time • Interaction and ad ‐ hoc control (human in the loop) 100% Online Traditional Did you see any problem with “Online”? Time 8
Example (Conventional) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 1 second 30 seconds 5 minutes Conventional database 9
Example (Conventional) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 minutes Avg Stock Price = 1000 Conventional database 10
Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 1 second Avg Stock Price $2031+/ ‐ $523 90% 5% Sampling Progress With online aggregation 11
Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 seconds Avg Stock Price $1890+/ ‐ $420 95% 15% Sampling Progress With online aggregation 12
Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 5 minutes Avg Stock Price $1150+/ ‐ $210 97% 40% Sampling Progress With online aggregation 13
Example (online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 minutes Avg Stock Price $1040+/ ‐ $70 99% 95% Sampling Progress With online aggregation 14
Example: Online Aggregation Additional Features: Speed up Slow down Terminate 15
Example: Online visualization 16
Example: Browsing 17
Key benefit: “Premature” termination After 5 minutes • If acceptably accurate answer reached quickly, the query can be aborted Conventional database Avg Stock Price $1150+/ ‐ $210 97% 40% Sampling Progress With online aggregation Stop early 18
Why Stop Early?? • Save human time (30 min vs 5 min) – Precise vs estimate answers • For exploratory applications • Save machine time – Save cost $$ • Very important when dealing with BIG DATA in the cloud – Pay for what you used (users need to justify the cost to the organization) 19
Analytic queries are costly! QphH = Query ‐ per ‐ Hour Performance 20 20
Solution: Online aggregation • Users must get continual feedback on results of processing – Observe the progress of their queries – Give continually improving partial results: aggregates have running output and confidence interval • Control execution on ‐ the ‐ fly For a retailer, approximate result, Average Sales: $21,712+/ ‐ $47 $21,795+/ ‐ $105 $21,255+/ ‐ $286 $22,131+/ ‐ $523 85% 90% 95% 98% such as $21,712+/ ‐ $47, can provide a good estimation for its 40% 35% 20% 15% daily sale’s statistics. And it is Sampling Progress more cost effective. 21 21
Statistical estimation • Users do not need to set a priori specification of stopping condition • The interface is easier for users with no statistical background • It requires more powerful statistical estimation techniques (Hoeffding’s inequality versus Chebyshev’s inequality) 22 22
Usability goals • Continuous observation • Control of time/precision • Control of fairness/partiality 23 23
Performance goals • Minimum time to accuracy – produce a useful estimate of the final answer ASAP • Minimum time to completion – secondary goal, assume user will terminate processing long before the final answer is produced • Pacing – guarantee a smooth and continuous improving display 24 24
What are the tools?? 25
Random access to data • We need to retrieve data in random order to produce meaningful statistical estimation • At any time, the input to the query is a sample • Input grows over time until • the query is terminated prematurely or • all data examined 26 26
Sampling – design issues • Granularity of sample – Record ‐ level: high I/O cost – Block ‐ level: high variability from clustering • Types of sample – Often simple random sample (SRS) – With/without replacement?? • Data structures from which to sample – Files or relational tables – Indexes (B + trees, etc) 27
Row ‐ level sampling techniques • Maintain file in (pseudo) random order – Sampling = scan – Is file initially in random order? • Statistical tests needed: e.g., Runs test, Kolmogorov ‐ Smirnov test – Can start scans from random positions • Best I/O behavior – Must “freshen” ordering (online reorg) • On ‐ the ‐ fly sampling – Index scan via index on “random” column • Indexed attributes are different from (and not correlated to) aggregated attributes, e.g., name is not correlated to salary – Else get random page, then row within page • Less efficient • Problem: variable number of records on page 28
Sampling from Index Root 17 27 5 13 30 2* 3* 33* 34* 38* 39* 5* 7* 8* 22* 24* 27* 29* 14* 16* 29
How expensive is row ‐ level sampling? 100 Pages fet ched (% ) 80 60 40 20 0 2 4 6 8 2 4 6 8 0 1 2 . . . . . . . . 0 0 0 0 1 1 1 1 Sampling Rat e (% ) 30
Group By operation • Aggregate queries involve “grouping”, e.g., count of grades of students, average sales per month, etc • How are “group ‐ by” queries processed? – Sort on “group ‐ by” attributes, e.g., sort by grades, then count – Hash on “group ‐ by” attributes • Same attribute values will be hashed to the same bucket, e.g., students with the same grade will be grouped together 31
Non ‐ blocking GROUP BY and DISTINCT • Blocking operator – Cannot work on another operator while the current operator is being processed • Sorting is a blocking algorithm – only one group is computed at a time after sorting • Hashing is non ‐ blocking, but … – hash table need to fit in memory to have good performance 32 32
Index striding • For fair Group By: – Want to update all groups • Challenging for groups with small number of records – Want random tuple from Group 1, random tuple from Group 2, ... – Idea • Index gives tuples from a single group • Opens many cursors in index, one per group • Fetch records in round ‐ robin • Can control speed by weighting the schedule – Gives fairness/partiality, info/speed match! 33 33
Conclusion • Big data analytics is becoming increasingly important • Online aggregation (OLA) is a promising direction • OLA mechanisms – Random sampling – Non ‐ blocking schemes – Index striding • Online Hadoop is a variant of Hadoop that offers a flavor of “online aggregation” 34
Recommend
More recommend