PGCon 2020 Tatsuro Yamada Julien Rouhaud
Who we are Tatsuro Yamada ⁃ Works for NTT Comware as a Database Engineer ⁃ PostgreSQL Contributor ⁃ Oracle_fdw Committer ⁃ pg_plan_advsr Author ⁃ Member of PGConf.Asia Organizers since 2016 Julian Rouhaud ⁃ Works for VMware ⁃ PostgreSQL Contributor ⁃ HypoPG Author ⁃ pg_qualstats Co-Author
Agenda 1. What is a Query plan? 2. Why inefficient plan can be chosen? 3. Our Challenges to get an efficient plan 4. Conclusion
1. What is a Query plan?
1. What is a Query plan? SQL • declarative Language Query Plan • planner steps to retrieve the wanted data • contains multiple plan nodes (e.g. scan, join, aggregate, …) • tree structure Planner • selects a query plan it thinks is the most efficient plan based on calculated cost Does planner select an efficient plan every time?
Inefficient plan • It tends to be a long query execution time • EXPLAIN command allows checking a query plan e.g. EXPLAIN (ANALYZE, BUFFERS) select ….
Inefficient plan In this example, using an Index can reduce the execution time QUERY PLAN ----------------------------------------------------------------------------------- Nested Loop ([...] rows=313475000 width=16) ([...] rows=0 loops=1) -> Seq Scan on pgqs t2 ([...] rows=25078 width=8) ([...] rows=25000 loops=1) Filter: (val1 = 0) Rows Removed by Filter: 25000 -> Materialize ([...] rows=12500 width=8) ([...] rows=0 loops=25000) -> Seq Scan on pgqs t1 ([...] rows=12500 width=8) ([...] rows=0 loops=1) Filter: ((val1 = 0) AND (val2 = 0)) Rows Removed by Filter: 50000 Why inefficient plan can be chosen? https://rjuju.github.io/postgresql/2020/02/28/pg_qualstats-2-selectivity-error.html
2. Why inefficient plan can be chosen?
2. Why inefficient plan can be chosen? Problem Reasons Root causes Inefficient Bad Index deficiency plan scan method Bad Inaccurate statistics join method Bad Planner specification join order
2.1. Index deficiency or not working Problems • No suitable index available • Good candidate index can't be used Solutions • Create suitable index and check query writing • Better to create a minimum number of indexes to achieve better performance • If you have thousands of queries, it’s a tough work
2.2. Inaccurate statistics What are statistics? Query • Sampled data from a table Planner e.g. • Number of rows Plan Plan • Number of unique values Plan • Most common values Stats How does planner use statistics? • To calculate the cardinality/selectivity of nodes, and cost of query plans • Planner assumes the lowest cost plan is the most efficient plan In what cases can statistics be inaccurate?
2.2. Inaccurate statistics Cases a. Autovacuum not properly tuned b. Usage of temporary tables c. Massive imports directly followed by queries or in the same transaction
2.2. Inaccurate statistics a. autovacuum not properly tuned Problem • autovacuum appears to not work properly Solution • Tune the threshold for executing analyze by using GUC parameters: • autovacuum_analyze_threshold • autoavacuum_analyze_scale_factor • And ideally on a per-table basis • ALTER TABLE tbl SET (autovacuum_analyze _... TO …)
2.2. Inaccurate statistics b. Usage of temporary tables Problem • Temporary tables have no statistics by default, as autovacuum won’t see them • Planner uses Default rows number to calculate a cost Solution • Run manual ANALYZE on temporary tables
2.2. Inaccurate statistics c. Massive imports followed by queries in the same transaction Problem • A query executed just after importing a large number of rows can lead to inefficient plan since planner can’t use up to date statistics Solutions • Execute analyze manually is better than waiting for autovacuum • Divide the transaction in three processing steps, such as import, analyze, and query processing, if possible imports + query imports analyze query
2.3. Planner specification Selectivity estimation for AND-ed quals Problem • The planner by default assumes that columns are totally independent for selectivity estimation Example WHERE zipcode = 75 AND city = ‘Paris’ • Paris zipcode is 75, so both predicate are actually redundant • If both predicates retain 1% of the rows, applying both should also retain 1% of the rows • But the planner will by default multiply the selectivity, assuming that only 0.01% of the rows will be retained
2.3. Planner specification Solution • Detect those correlated columns • Create extended statistics (CREATE STATISTICS)
Causes and Solutions Problem Reasons Root causes Solutions Inefficient Bad Index Create Index plan scan method deficiency Bad Inaccurate Appropriate join method statistics setting Bad Planner Create join order specification Extended Stats Plan Tuning Is it possible to apply the solutions easily?
3. Our Challenges to get efficient plan
3. Our Challenges to get an efficient plan 3.1. Automatic indexing tool • pg_qualstats 3.2. Auto query plan tuning tool • pg_plan_advsr 3.3. What's next?
3.1. Auto index and extended statistics advising tool: pg_qualstats 3.1.a What is it? 3.1.b How does index suggestion work? 3.1.c Demo 3.1.d Summary
3.1.a What is pg_qualstats • PostgreSQL extension that keeps track of predicates (WHERE/JOIN clauses) statistics during query execution • Selectivity, error in selectivity estimation, number of execution, type of scan… • Aggregate per unique query (same as pg_stat_statements) • And a global index advisor using those statistics • Take into account the server actual workload • Use multiple heuristics to suggest the minimum number of new index • E.g. Preference for multi-column indexes • Returns new index suggestion and list predicates that can automatically be optimized • E.g. Operator that don’t have any index AM compatibility
3.1.b How does index suggestion work • Retrieve the list of “interesting” predicates • Filter more than 50% of the rows • Using a sequential scan • For queries run more than 10 times • …
3.1.b How does index suggestion work • Build the full set of paths, with each AND-ed predicates combination
3.1.b How does index suggestion work • Compute a score for each path, giving a weight to each predicate corresponding to the number of « simple » predicate
3.1.b How does index suggestion work • Highest score is the « best »index • a single index optimizing the higher number of predicate • Generate the index DDL given the predicate list • The correct column order correspond to the ascending weight in the path! • Remove all optimized predicates from the list, and start again until all predicates are optimized
3.1.c Demo – the queries
3.1.c Demo – optimizable predicates
3.1.c Demo – non optimizable predicates • The ILIKE (~~*) can’t be optimized automatically. DBA attention is required here to know how to properly handle this.
3.1.d Summary • Useful extension to extract knowledge on your production workload • H elps DBA to focus on the “complex” optimization part thanks to the global index suggestion • But also gives a lot of other information that can be used for other parts of optimization • Automatically find correlated columns • Automatically find problematic statistics • …
3.2. Auto query plan tuning tool: pg_plan_advsr 3.2.a What is this? 3.2.b How it works 3.2.c Demo 3.2.d Summary
3.2.a What are pg_plan_advsr and its merit pg_plan_advsr • allows automatic tuning of a query plan by using a feedback loop Advantages - The Machine performs the tuning, rather than humans - Tuning is possible even when there is no expert. - For experts, it is possible to gain new awareness - Can find an optimal plan - Since it can controls each plan node, It can create an optimal plan that might not be selected by Vanilla PostgreSQL - Availability of all plan history in tuning process - Moreover, able to reproduce any plan from the history
3.2.a What are pg_plan_advsr and its merit Use-cases • System development field • Tuning during performance tests and performance troubles • Regular reporting process • Analytical processing • Batch processing • Field of study • Verify planner's capability • You can check the performance of the selected plan when there is no estimation error How did this tool achieve automatic tuning? Note: - pg_plan_advsr is in POC phase, so recommend to use it in a verification environment - Refer to the manual for details due to functional restrictions
3.2.b How does Auto plan tuning work - Add Feedback Loop 2. Correct Query estimation error by using feedback info on - EXPLAIN Analyze next query execution Parser - PDCA cycle Feedback Info Planner pg_plan_advsr 3. Record Executer Error Info [1] 1. Detect Result estimation error [1] The difference between Estimated rows and Actual rows
3.2.b Rough concept of pg_plan_advsr How to correct an estimation error 1st iteration 2nd iteration Nest Loop Hash Join Feedback Info estimate: 1 estimate:1000 (Hints) actual: 1000 actual: 1000 (A B) (A B) Fix A B A B Seq Scan Index Scan Seq Scan Index Scan
Recommend
More recommend