improve your sql workload with observability
play

Improve your SQL workload with observability PostgresOpen 2019 @ - PowerPoint PPT Presentation

Improve your SQL workload with observability PostgresOpen 2019 @ Orlando Wilfried ROSET, Engineering Manager 1 Speaker Wilfried ROSET wilfriedroset @ github Engineering Manager @ OVH 2 WE BUILD A a truly Cloud SMART DIFFERENT


  1. Improve your SQL workload with observability PostgresOpen 2019 @ Orlando Wilfried ROSET, Engineering Manager 1

  2. Speaker • Wilfried ROSET • wilfriedroset @ github • Engineering Manager @ OVH 2

  3. WE BUILD A a truly Cloud SMART DIFFERENT CLOUD S IMPLE AND QUICK TO SET UP M ULTI-LOCAL , CLOSE TO EACH PERSON THROUGHOUT THE WORLD WHOSE COST IS A CCESSIBLE AND PREDICTABLE R EVERSIBLE , OPEN AND INTEROPERABLE T RANSPARENT AND RESPONSIBLE TO GROW AND SUCCEED TOGETHER 3

  4. WE ARE 2 200 IN 18 A GLOBAL CLOUD PROVIDER EMPLOYEES COUNTRIES +1 500 000 IN 132 CUSTOMERS COUNTRIES + 1 million BUILDER OF ITS OWN SERVERS BHS SERVERS built x6 SINCE 2002 Europe since 1999 x18 Hillsboro Vinthill x1 x1 Singapore x1 28 EXISTING SINCE Sydney 1999 x1 DATA CENTERS in 20 12 locations Tbps 34 Point of Presence WE ARE IN 3 CONTINENTS AN INDUSTRIAL PLAYER 4

  5. Everyone uses SQL • Directly or Indirectly: – CEO / CTO / CXO – Dev / DevOps / SRE – Support & Run 5

  6. Is this happening to you? « Our API is slow « We need a bigger DB! » because of the DB! » 6

  7. No Observability No 7

  8. Ugly Observability 8

  9. Good Observability 9

  10. Our internal Databases perimeter 3k Applications 400 Databases 60 Clusters 20TB Disk space 2 à 4 Person 24 Hours per day 10

  11. How to isolate a SLOW query db# SELECT username FROM customers WHERE username LIKE ‘_wilfried%’; Time: 9433.400 ms (00:09.433) Tips: it’s not a missing index ;) 11

  12. Internal DBs Infra Read/Write Prod Primary Replication Read-only Backup Replica Read/Write Archive Dev 24-hour restore Dev Replica point 12

  13. Old request process " Why is my DB slow? " " Can I have an Team DB extract from my logs? " 13

  14. Easier, Better, Faster, Stronger Self-service for complete autonomy • Meaningful KPIs • Capacity planning • Ownership and proactivity • 14

  15. Prerequisites • As-a-service – OVH Logs – OVH Metrics – UI • Open source – Data collection 15

  16. Logs 16

  17. First things first Observability is not about ho how to to co collect ct da data but wha what to to d do wit with it it 17

  18. DBMS Configuration • PostgreSQL – Everything happens in postgresql.conf – Format log to produce report • log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h ' – Log slow queries • log_min_duration_statement = '1000' 18

  19. DBMS Configuration • MySQL / MariaDB – Everything happens in my.cnf – Performance Schema – Log slow queries • slow_query_log = 1 • slow_query_log_file = /var/log/mysql/slow.log 19

  20. Give meaning to logs • Process log files and produce fully readable reports pgbadger for PostgreSQL – pt-query-digest for MySQL/MariaDB – 20

  21. Observability: step 1/5 Reporting " Why is my DB slow? " " Can I have Team DB an extract from my logs? " 21

  22. Output Sample 22

  23. Logs, Logs, Logs… • Logs are useful only if they are usable – <3 grep, less, … – Avoid ssh 23

  24. Logs pipeline DBMS Servers Logstash Clusters ElasticSearch indexes 24

  25. Let’s grok User@Host: ap app_u _user er @ [ 1. 1.2. 2.3. 3.4 ] ^# User@Host: %{ WO WORD:query_user }\s*@\s*\[%{ IP IP:query_ip }?\].* { "query_user": "app_user", "query_ip": "1.2.3.4" } 25

  26. Sending Data • Use whatever is Efficient && Easy to setup for your case Filebeat – Syslog – 26

  27. Filebeat # /etc/filebeat /filebeat.yml output.logstash: hosts: ["graX.logs.ovh.com:6514"] ssl.certificate_authorities: ["/etc/filebeat/cert.pem"] filebeat.prospectors: - type: log enabled: true paths: - /var/log/postgresql/postgresql.log 27

  28. Syslog destination d_pg_ldp { tcp( "graX.logs.ovh.com", port(6514), ts_format("iso"), keep-alive(yes), so_keepalive(yes), log-fifo-size(10000), ); }; 28

  29. Observability: step 2/5 Reporting " Why is my DB slow? " " Can I have Team DB Syslog/Filebeat an extract from my Logs logs? " 29

  30. Graylog 30

  31. Oops, Proxy and Connection Pooler… Load Connection Balancer Pooler With our setup we lost Source IP: Replica Less usefull pg_hba • VIP & Harder to track bad queries • Connection Load Pooler Security audit • Balancer Primary Load Connection Balancer Pooler Replica 31

  32. … Proxy Protocol What about Proxy Protocol? We want to help: Proxy Protocol support @ PostgreSQL Hackers à https://bit.ly/2MN2H8U • PR#390 @ pgbouncer à https://github.com/pgbouncer/pgbouncer/pull/390 • 32

  33. Metrics 33

  34. It’s not about how Metrics is no not ab is about t whi which to tools ls to use But wha what to to d do wi with yo your dat data 34

  35. Collect your metrics • There are several options for doing this: – Homemade à please don’t do that – Collectl, Collectd, Statsd... – Telegraf, Prometheus… 35

  36. Telegraf’s plug-ins • Plug-ins: – System-related • cpu, disk, diskio, system, network ... – DBMS • PostgreSQL, MySQL – Homemade • Exec • There is a lot more plug-ins in telegraf 36

  37. Let’s push some metrics # /etc/telegraf/telegraf.conf [agent] interval = "30s" flush_interval = "30s" [[outputs.influxdb]] urls = ["https://influxdb.graXXX.metrics.ovh.net"] timeout = "15s" username = "telegraf" password = "write.token.from.metrics.manager" 37

  38. Observability: step 3/5 Reporting " Why is my Telegraf Team DB DB slow? " Metrics " Can I have Syslog/Logstash an extract from my logs? " Logs 38

  39. System dashboard 39

  40. I/O dashboard 40

  41. DBMS dashboard 41

  42. Grafana is beautiful! 42

  43. Observability: step 4/5 Reporting Metrics " Why is my Telegraf DB slow? " " Can I have Logs Syslog/Logstash an extract from my logs? " 43

  44. Slow queries count/cumul. time 44

  45. Spot bugs 45

  46. Spot fixes 46

  47. We already knew that! • This is well known methods – U tilization S aturation E rrors – R ate E rrors D urations – Four Golden Signals 47

  48. 80/20 • Don’t try to fix the world… • Focus where it matters – Start by your Top producers and iterate 48

  49. Talk… Hey folks, I’ve open 60 tickets for your database slow queries and errors. XOXO <3 49

  50. Talk, again… Folks, come on! What the F*** 50

  51. Always talk (louder) … That’s it folks! I’m done with you. No more prod for you until you fixe everything. 51

  52. … But do it gently Ok folks, my bad. Let’s try again. 52

  53. Gamification • Make it fun • Make it interesting • Turn it to a game • Each game needs a …. 53

  54. Leader Board 54

  55. Leader Board Weekly Mail Hello, You will find information which can help you identify your queries in our welcome guide: < insert documentation link > Tldr: • Database 33 did great because of blablabla • Database 94 is our #1 producer of slow queries • Blablabla … 55

  56. Engage regularly 56

  57. What have we learned? • Observability can be set up in le less ss tha than a a we week • Choose the right tools for the job – Grafana for dashboard (mix and match sources) – Graylog for search – Reporting tools are not used by developers • 1 year down the road, we can expect… – A widely-adopted gamechanger – Inspire others to do the same kind of report <3 – x4 x4 le less ss slo slow que queries • KPIs 57

  58. What’s next? • The easy part is done, we need to dig deeper – Engage at higher level – Point everyone in the right direction… Broadcast the KPI – Help Developper refactor app & schema • Improve monitoring • Feed the data to Machine Learning • Automatic indexes recommendations – Thanks to Percona blog – Based on pg_qualstats & hypopg 58

  59. Observability: step 5/5 Metrics Machine " Why is my Learning Telegraf DB slow? " " Can I have Syslog/Logstash an extract Logs from my logs? " 59

  60. Remember this SLOW query db# SELECT username FROM customers WHERE username LIKE ‘_wilfried%’; Time: 9433.400 ms (00:09.433) Tips: it’s not a missing index ;) 60

  61. PS: We are hiring! Opensource database expert • Site Reliability Engineers (Private cloud, Openstack, DNS, Deploy, Observability) • Software engineers (containers, baremetal, webhosting) • Back-end developers (go, python) • Engineering manager webhosting • … And a lot more • 61

  62. Questions? 62

Recommend


More recommend