histograms in mariadb mysql and postgresql
play

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, - PowerPoint PPT Presentation

Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th 27th, 2017 Santa Clara, California | April 24th 27th, 2017 What this talk is about Data statistics


  1. Histograms in MariaDB, MySQL and PostgreSQL Sergei Petrunia, MariaDB Sergei Petrunia, MariaDB Santa Clara, California | April 24th – 27th, 2017 Santa Clara, California | April 24th – 27th, 2017

  2. What this talk is about ● Data statistics histograms in – MariaDB – MySQL (status so far) – PostgreSQL ● This is not a competitive comparison – Rather, a survey 2

  3. Histograms and query optimizers Click to add text Click to add text

  4. Query optimizer needs data statistics select * from customers join orders on customer.cust_id=orders.customer_id where customers.balance<1000 and orders.total>10K ● Which query plan enumerates fewer rows – orders->customers or customers->orders? ● It depends on row counts and condition selectivities ● Condition selectivity has a big impact on query speed 4

  5. Data statistics has a big impact on optimizer ● A paper " H ow good are query optimizers, really?" – Leis et al, VLDB 2015 ● Conclusions section: – "In contrast to cardinality estimation, the contribution of the cost model to the overall query performance is limited." ● This matches our experience 5

  6. Data statistics usage ● Need a *cheap* way to answer questions about – Numbers of rows in the table – Condition selectivity – Column widths – Number of distinct values – … ● Condition selectivity is the most challenging 6

  7. Histogram as a compact data summary ● Partition the value space into buckets ● Keep an array of (bucket_bounds, n_values) – Takes O(#buckets) space 7

  8. Histogram and condition selectivity col BETWEEN ‘a’ AND ‘b’ ● Sum row counts in the covered buckets ● Partially covered bucket? – Assume a fraction of rows match – This is a source of inaccuracy ● More buckets – more accurate estimates 8

  9. Histogram types ● Different strategies for choosing buckets – Equi-width – Equi-height – Most Common Values – ... 9

  10. Equi-width histogram ● Bucket bounds pre-defined – Equal, log-scale, etc ● Easy to understand, easy to collect. ● Not very efficient – Densely and sparsely-populated regions have the same #buckets – What if densely-populated regions had more buckets? 10

  11. Equi-height histogram ● Pick the bucket bounds such that each bucket has the same #rows – Densely populated areas get more buckets – Sparsely populated get fewer buckets ● Estimation error is limited by bucket size – Which is now limited. 11

  12. Most Common Values histogram ● Suitable for enum-type value1 count1 domains ● All possible values fit in the value2 count2 histogram value3 count3 ● Just a list of values and ... ... frequencies 12

  13. Histogram collection algorithms ● Equi-width – Find (or guess) min and max value – For each value ● Find which histogram bin it falls into ● Increment bin’s counter ● Equi-height – Sort the values – First value starts bin #0 – Value at n_values * (1/n_bins) starts bin #2 – Value at n_values * (2/n_bins) starts bin #3 – ... 13

  14. Histogram collection strategies ● Scan the whole dataset – Used by MariaDB – Produces a “perfect” histogram – May be expensive ● Do random sampling – Used by PostgreSQL (MySQL going to do it, too?) – Produces imprecise histograms – Non-deterministic results ● Incremental updates – hard to do, not used 14

  15. Summary so far ● Query optimizers need condition selectivities ● These are provided by histograms ● Histograms are compact data summaries ● Histogram types – Width-balanced – Height-balanced (better) – Most-Common-Values ● Histogram collection methods – Scan the whole dataset – Do random sampling. 15

  16. Histograms in MariaDB Click to add text Click to add text

  17. Histograms in MariaDB ● Available in MariaDB 10.0 – (Stable since March, 2014) ● Used in the real world ● Good for common use cases – has some limitations ● Sometimes are called “Engine-Independent Table Statistics” – Although being engine-independent is not the primary point. 17

  18. Histogram storage in MariaDB ● Are stored in mysql.column_stats table CREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name) ); ● Very compact: max 255 bytes (per column) 18

  19. Collecting a histogram ● Manual collection only set histogram_size=255; set histogram_type='DOUBLE_PREC_HB'; analyze table tbl persistent for all; analyze table tbl persistent for columns (col1, col2) indexes (); +----------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------+---------+----------+-----------------------------------------+ | test.tbl | analyze | status | Engine-independent statistics collected | | test.tbl | analyze | status | OK | +----------+---------+----------+-----------------------------------------+ ● Make the optimizer use it set use_stat_tables='preferably'; set optimizer_use_condition_selectivity=4; <query>; 19

  20. Examining a histogram select * from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** db_name: babynames table_name: pop1980_cp column_name: firstname min_value: Aaliyah max_value: Zvi nulls_ratio: 0.0000 avg_length: 6.0551 avg_frequency: 194.4642 hist_size: 32 hist_type: DOUBLE_PREC_HB histogram: � ���C)�G�[j\�\�fzz�z]����3� select decode_histogram(hist_type,histogram) from mysql.column_stats where table_name='pop1980_cp' and column_name='firstname' *************************** 1. row *************************** decode_histogram(hist_type,histogram): 0.00201,0.04048,0.03833,0.03877,0.04158,0.11852,0.07912,0.00218,0.00093,0.03940, 0.07710,0.00124,0.08035,0.11992,0.03877,0.03989,0.24140 20

  21. Histograms in MariaDB - summary ● Available since MariaDB 10.0 ● Special ANALYZE command to collect stats – Does a full table scan – May require a lot of space for big VARCHARs: MDEV-6529 “EITS ANALYZE uses disk space inefficiently for VARCHAR columns” ● Not used by the optimizer by default – Special settings to get optimizer to use them. 21

  22. Histograms in PostgreSQL Click to add text Click to add text

  23. Histograms in PostgreSQL ● Data statistics – Fraction of NULL-values – Most common value (MCV) list – Height-balanced histogram (excludes MCV values) – A few other parameters ● avg_length ● n_distinct_values ● ... ● Collection algorithm – One-pass random sampling 23

  24. Collecting histograms in PostgreSQL -- Global parameter specifying number of buckets -- the default is 100 set default_statistics_target=N; -- Can also override for specific columns alter table tbl alter column_name set statistics N; -- Collect the statistics analyze tablename; postgresql.conf, or per-table # number of inserted/updated/deleted tuples to trigger an ANALYZE autovacuum_analyze_threshold = N # fraction of the table size to add to autovacuum_analyze_threshold # when deciding whether to trigger ANALYZE autovacuum_analyze_scale_factor=N.N 24

  25. Examining the histogram select * from pg_stats where tablename='pop1980'; tablename | pop1980 attname | firstname inherited | f null_frac | 0 avg_width | 7 n_distinct | 9320 most_common_vals | {Michael,Jennifer,Christopher,Jason,David,James, Matthew,John,Joshua,Amanda} most_common_freqs | {0.0201067,0.0172667,0.0149067,0.0139,0.0124533, 0.01164,0.0109667,0.0107133,0.0106067,0.01028} histogram_bounds | {Aaliyah,Belinda,Christine,Elsie,Jaron,Kamia, Lindsay,Natasha,Robin,Steven,Zuriel} correlation | 0.0066454 most_common_elems | 25

  26. Histograms are collected by doing sampling src/backend/commands/analyze.c, std_typanalyze() refers to ● "Random Sampling for Histogram Construction: How much is enough?” ● – Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, ACM SIGMOD, 1998. Rows in table (=10^6) Histogram size Error probability (=0.01) Random Max relative error in bin (=0.5) sample size ● 100 buckets = 30,000 rows sample 26

  27. Histogram sampling in PostgreSQL ● 30K rows are sampled from random locations in the table – Does a skip scan forward – “Randomly chosen rows in randomly chosen blocks” ● Choice of Most Common Values – Sample values that are 25% more common than average – Values that would take more than one histogram bucket. – All seen values are MCVs? No histogram is built. 27

  28. Beyond single-column histograms ● Conditions can be correlated select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light' 'swimsuit' ● Correlation can have a big effect – MIN(1/n, 1/m) – (1/n) * (1/m) – 0 ● Multi-column “histograms” are hard ● “Possible PostgreSQL 10.0 feature: multivariate statistics” 28

Recommend


More recommend