C OLUMN D ATABASES A N D R E W C R O T T Y & A L E X G A L A K AT O S
O UTLINE • RDBMS • SQL • Row Store • Column Store • C-Store • Vertica • MonetDB • Hardware Optimizations
F ACULTY M EMBER V ERSION
E XPERIMENT • Question : How does time spent as a faculty member affect one's attention span ?
H YPOTHESIS Attention Span vs Time as Faculty Attention Span 0 < 1 ≈ 10 > 100 Time as Faculty (years)
W HAT IS AN RDBMS? • Relational database management system • Standard for information storage/retrieval • Based on Codd's relational model • Data structured as relations • Manipulated using relational operators • Structure specified by a schema
E XAMPLE id name dept salary 1756 Scott Physics 50k 1757 Bob Math 60k 1758 John CS 80k
W HAT IS SQL? • Structured Query Language • Used to operate on relations • Four main operations: • select • insert • update • delete
S ELECT S TATEMENT select <column(s)> from <table> where <column> = <val>
E XAMPLE id name dept salary 1756 Scott Physics 50k 1757 Bob Math 60k 1758 John CS 80k
R OW S TORE id name dept salary 1756 Scott Physics 50k id name dept salary id name dept salary 1756 Scott Physics 50k 1757 Bob Math 60k 1757 Bob Math 60k 1758 John CS 80k id name dept salary 1758 John CS 80k
R OW S TORE • What is the name, department, and salary of the employee with id 1757 ? select name, dept, salary from employee where id = 1757
R OW S TORE id name dept salary 1756 Scott Physics 50k select name, dept, salary id name dept salary from employee 1757 Bob Math 60k where id = 1757 id name dept salary 1758 John CS 80k
R OW S TORE • What is the average salary of all employees ? select avg(salary) from employee
R OW S TORE id name dept salary 1756 Scott Physics 50k id name dept salary select avg(salary) 1757 Bob Math 60k from employee id name dept salary 1758 John CS 80k
R OW S TORE id name dept salary 1756 Scott Physics 50k id name dept salary select avg(salary) 1757 Bob Math 60k from employee id name dept salary 1758 John CS 80k
C OLUMN S TORE id name 1756 Scott id name dept salary 1757 Bob 1758 John 1756 Scott Physics 50k 1757 Bob Math 60k dept salary 1758 John CS 80k Physics 50k Math 60k CS 80k
C OLUMN S TORE • What is the average salary of all employees ? select avg(salary) from employee
C OLUMN S TORE id name 1756 Scott 1757 Bob 1758 John select avg(salary) from employee dept salary Physics 50k Math 60k CS 80k
H ISTORY • TAXIR • 1969 • Biology information retrieval • RAPID • 1976 • Statistics Canada • Canadian Census of Population and Housing • Sybase IQ • Early 1990s • Sybase (later SAP) • Only commercial column DB for many years
C-S TORE • Project from Brown, Brandeis, MIT, and UMass Boston • Read-optimized • Contributions: • Hybrid architecture • Novel data structures • Advanced compression 2005 : Winner "Best Logo" Award
A RCHITECTURE 2005 : Winner "Most Intricate Architecture" Award
T UPLE M OVER • What does it do ???
T UPLE M OVER • What does it do ??? • Moves tuples 2005 : Winner "Most Creative Name" Award
P ROJECTIONS name dept John CS id name dept salary Bob Math Scott Physics 1756 Scott Physics 50k 1757 Bob Math 60k name salary 1758 John CS 80k Scott 50k Bob 60k John 80k
J OIN I NDEXES name dept John CS Bob Math Scott Physics id name dept salary id 1756 Scott Physics 50k 1756 1757 Bob Math 60k 1757 1758 John CS 80k 1758 name salary Scott 50k Bob 60k John 80k
C OMPRESSION • Similar data types • Less information = less I/O • Operate on compressed data • Compression strategies: • Run length encoding (v,f,n) • Delta encoding • Bitmap encoding (v,b) • Distinct value encoding
R UN L ENGTH E NCODING Original gender female Result male gender male (f, 0, 1) male (m, 1, 4) male (f, 5, 3) female female female
D ELTA E NCODING Result Original salary salary 70,000 70,000 70,500 500 70,900 400 71,250 350 75,000 4,250 79,000 3,500 81,500 2,500 82,000 500
B ITMAP E NCODING Original dept Result CS dept bitmap Math CS 10001100 Math Math 01100001 English English 00010000 CS Biology 00000010 CS Biology Math
D ISTINCT V ALUE E NCODING Stage 1 Stage 2 Original gender age gender age gender/age female 5 0 00 00 male 4 1 01 01 male 4 1 01 01 male 8 1 10 10 female 4 0 01 11 female 5 0 00 00
L ATE M ATERIALIZATION • Tuple reconstruction is: • Costly • Often unnecessary • Solution : • Apply predicates early per column • Reconstruct tuples as late as possible • Bring in only relevant columns • Also improves cache performance
V ERTICA • Commercialization of C-Store • BI and analytics market • Acquired by HP in 2011 • Differences from C-Store • Automatic layout tool • No join indexes • Instead requires super projection
M ONET DB • Developed at CWI (Netherlands) • High performance on complex queries • Column store architecture • Uses demand paging • Exploits CPU caches • Automatic/self-tuning indexes • Open source!
H ARDWARE O PTIMIZATIONS • Ocelot • Hardware-oblivious abstraction • Built on top of MonetDB • Easily and efficiently support operations on: • CPU • GPU • FPGA • Take advantage of optimal memory access patterns • CPU: prefetching, cache awareness • GPU: coalesce memory accesses
H ARDWARE O PTIMIZATIONS
S UMMARY • RDBMS • SQL • Row Store • Column Store • C-Store • Vertica • MonetDB • Hardware Optimizations Questions ?
Recommend
More recommend