Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep Chakraborty VMware Greenplum 1
Agenda ● Goals ● Design internals with demo ● Performance ● Open areas of work ● Get involved! 2
Goals ● A column store that every Postgres user can use ● Fully leverage the on-disk adjacency of column-wise storage ○ Efficient and extensive compression ○ Queries on subsets of columns should be fast ● Optimized for OLAP workloads and bulk data ingestion ● Reasonable OLTP performance and feature parity with heap ● Fully MVCC, crash-safe and supports replication 3
Design 4
Design ● Leverages the Table AM API ● Leverages PostgreSQL infrastructure ○ 8KB (BLCKSZ) fixed block size ○ Buffer manager ○ WAL logging - with custom WAL records ● Each column is a B-Tree 5
Forest of B-Trees Attribute Tree Attribute Tree Attribute Tree Tid Tree attnum = 1 attnum = 2 attnum = n attnum = 0 A block on disk 6
ZSTid & B-Tree pages ● 64-bit tuple identifier, key of the B-Trees block 9 (lokey=1, hikey=MaxZSTid) ● 1-1 mapping with ItemPointer: 1 15922 24616 ○ ZSTidFromItemPointer(blk, off) = blk * 128 + off ○ Only 48 bits is used ● Purely logical ○ Does not tie tuple to physical location block 6 block 8 block 12 (lokey=1, (lokey=15922, (lokey=24616, ○ A tuple’s tid never changes hikey=15922) hikey=24616) hikey=MaxZSTid)
Demo 8
Demo 9
Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 10
Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 11
Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 12
Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 13
INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 14
INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) zsbt_tid_multi_insert(..,1,..) 5 Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 15
INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 16
INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) data cursor len tuple_buffer attstream_buffer2 tid firsttid, lasttid, attlen, relid (hash reservation attbuffers attbyval key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 17
INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) data cursor len tuple_buffer attstream_buffer2 tid firsttid, lasttid, attlen, relid (hash reservation attbuffers attbyval key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums zsbt_attbuffer_flush() zsbt_attbuffer_flush() Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 18
SELECT i, j FROM foo zedstoream_beginscan(...) zedstoream_beginscan_with_column_projection(...) zedstoream_getnextslot(...) zsbt_tid_begin_scan(...) zsbt_attr_begin_scan(...) zsbt_tid_scan_next(...) For all attributes in col list: zsbt_attr_fetch(...) /* fill in visibility info */ return slot 19
SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() zsbt_tid_begin_scan(...) zsbt_attr_begin_scan(...) zsbt_tid_scan_next(...) For all attributes in col list: zsbt_attr_fetch(...) /* fill in visibility info */ return slot 20
SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() Attr tree zsbt_tid_begin_scan(...) Attnum = 1 ‘a’ Attname = i zsbt_attr_begin_scan(...) Type = char(1) zsbt_tid_scan_next(...) zsbt_attr_fetch(5, 1) For all attributes in col list: 4-6 ‘a’ zsbt_attr_fetch(...) /* fill in visibility info */ return slot 21
SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() Attr tree zsbt_tid_begin_scan(...) Attnum = 1 ‘a’ Attname = i zsbt_attr_begin_scan(...) Type = char(1) 42 zsbt_tid_scan_next(...) zsbt_attr_fetch(5, 1) For all attributes in col list: 4-6 ‘a’ zsbt_attr_fetch(...) /* fill in visibility info */ return slot Attr tree zsbt_attr_fetch(5, 7) Attnum = 7 Attname = j Type = int 4-6 42 22
TID tree leaf page layout ZSTidArrayItem PageHeaderData Header TIDs: UNDO slots UNDO slotwords ItemId ItemId ItemId ... - t_firsttid, t_lasttid - 64 bit simple-8b codeword - 0-2 unique UNDO - 64 bit ints packed with pd_lower - Sizes of other sections - Delta encoded pointers to UNDO pages each TID’s UNDO pointer - size of item itself - Typically 1-10 bits per TID - 2 bits per TID Free Space pd_upper ... Item Item Item Logical content: Physical content: pd_special ZSBtreePageOpaque TID | UNDO pointer TIDs | UNDO pointers | UNDO indexes 119 | ALL_VISIBLE 119, 1, 1, 2, 1, 1 | 7599293, 8000222 | V, 0, D, 0, 1, 1 Tid leaf page 120 | 7599293 121 | DEAD 123 | 7599293 124 | 8000222 125 | 8000222 23
Attribute tree leaf page layout ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Header compressed chunks Header Content: - size - flags (compressed or not) - decompressed_size - t_lasttid Attr leaf page 24
Attribute tree leaf page layout ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Header compressed chunks TIDs: Header Content: - 1-60 TIDs - size Datum 0 Datum 1 ... Datum N - 4-bit mode selector encodes - flags (compressed or not) nullity, TID lengths & datum lengths - decompressed_size - t_lasttid Logical content: Physical content: TID | column values TIDs | column values 31 | 31 31, 1, 1, 1,..., 1 | 31, 32, 33, 34, … 60 Attr leaf page 32 | 32 33 | 33 34 | 34 … 60 | 60 25
Demo 26
Attribute tree leaf w/ oversized datum ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Raw datum cannot fit in a block An inline-compressed chunk Datum 0 toast_compress_datum() Yes 1 48-bit zstid zs_toast_header_inline mode selector: toast PGLZ compressed - 32-bit compressed_size (varlena_modes[14]) datum - 32-bit rawsize toast type: inline PGLZ compressed datum fit in a block? An externally toasted chunk Datum 0 1 48-bit zstid No mode selector: toast zs_toast_header_external Toast Pages (varlena_modes[14]) (first toast block number) toast type: toast page 27
Recommend
More recommend