why are learned indexes so effective
play

Why are learned indexes so effective? Paolo Fabrizio Giorgio - PowerPoint PPT Presentation

Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna A classical problem in computer science Given a set of sorted input keys (e.g. integers)


  1. Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna

  2. A classical problem in computer science β€’ Given a set of π‘œ sorted input keys (e.g. integers) β€’ Implement membership and predecessor queries β€’ Range queries in databases, conjunctive queries in search engines, IP lookup in routers… 𝑛𝑓𝑛𝑐𝑓𝑠 36 = True 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ π‘žπ‘ π‘“π‘’π‘“π‘‘π‘“π‘‘π‘‘π‘π‘  50 = 48 2

  3. Indexes 𝑙𝑓𝑧 B-tree π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ 3

  4. Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ) positions keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ 4

  5. Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ) 4 3 positions 2 1 11 13 15 2 keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 2 3 4 π‘œ 5

  6. Learned indexes 𝑙𝑓𝑧 Black-box trained on a dataset of pairs (key, pos) 𝒠 = { 2,1 , 11,2 , … , (95, π‘œ)} positions keys (approximate) π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ Binary search in π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ βˆ’ 𝜁, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ + 𝜁 e.g. 𝜁 is of the order of 100–1000 6

  7. The knowledge gap in learned indexes Practice Theory Same query time of Same asymptotic query vs πŸ‘Ž traditional tree-based time of traditional indexes tree-based indexes Space improvements of Same asymptotic space vs πŸ‘ orders of magnitude, occupancy of traditional from GBs to few MBs tree-based indexes 7

  8. [Ferragina and Vinciguerra, PVLDB 2020] PGM-index: An optimal learned index 1. Fix a max error 𝜁 , e.g. so that keys in [π‘žπ‘π‘‘ βˆ’ 𝜁,π‘žπ‘π‘‘ + 𝜁] fit a cache-line 2. Find the smallest Piecewise Linear 𝜁 -Approximation (PLA) 3. Store triples (𝑔𝑗𝑠𝑑𝑒𝑙𝑓𝑧, π‘‘π‘šπ‘π‘žπ‘“, π‘—π‘œπ‘’π‘“π‘ π‘‘π‘“π‘žπ‘’) for each segment positions 8 24 keys 1 3 8 11 12 19 22 23 24 28 29 33 38 47 48 53 55 56 57 8 https://pgm.di.unipi.it 8 π‘žπ‘π‘‘ βˆ’ 𝜁, π‘žπ‘π‘‘ + 𝜁

  9. What is the space of learned indexes? β€’ Space occupancy ∝ Number segments β€’ The number of segments depends on β€’ The size of the input dataset β€’ How the points (𝑙𝑓𝑧, π‘žπ‘π‘‘) map to the plane β€’ The value 𝜁 , i.e. how much the approximation is precise 𝜁 ! 𝜁 " β‰ͺ 𝜁 ! positions positions positions keys keys keys 9

  10. Model and assumptions β€’ Consider gaps 𝑕 ! = 𝑙 !"# βˆ’ 𝑙 ! between consecutive input keys β€’ Model the gaps as positive iid rvs that follow a distribution with finite mean 𝜈 and variance 𝜏 $ 5 𝑕 ( 4 positions 𝑕 ) 3 𝑕 * 2 𝑕 + 1 𝑙 " 𝑙 # 𝑙 $ 𝑙 % 𝑙 & keys 10

  11. The main result Theorem . If 𝜁 is sufficiently larger than 𝜏/𝜈 , the expected number of keys covered by a segment with maximum error 𝜁 is 𝐿 = 𝜈 $ 𝜏 $ 𝜁 $ and the number of segments on a dataset of size π‘œ is π‘œ 𝐿 with high probability . 11

  12. The main consequence The PGM-index achieves the same asymptotic query performance of a traditional 𝜁 -way tree-based index while improving its space from 𝜀(𝒐/𝜻) to 𝑷(𝒐/𝜻 πŸ‘ ) Learned indexes are pr provably better than traditional indexes (note that 𝜁 is of the order of 100-1000) 12

  13. Sketch of the proof 1. Consider a segment on the stream of random gaps and the two parallel lines at distance 𝜁 2. How many steps before a new segment is needed? 𝜁 𝜁 positions Start a new segment from here keys 13

  14. Sketch of the proof (2) 3. A discrete-time random walk, iid increments with mean 𝜈 4. Compute the expectation of 𝑗 βˆ— = min 𝑗 ∈ β„• 𝑙 I , 𝑗 is outside the red strip i.e. the Mean Exit Time (MET) of the random walk Show that the slope 𝑛 = 1/𝜈 maximises 𝐹[𝑗 βˆ— ] , giving 𝐹[𝑗 βˆ— ] = 𝜈 J /𝜏 J 𝜁 J 5. Start a new 𝜁 random walker location segment from here 𝜁 (𝑙𝑓𝑧 ! βˆ— , 𝑗 βˆ— ) 𝜁 positions 𝑛 time Start a new 𝑗 βˆ— segment 𝜁 from here 𝑛 14 keys

  15. Simulations 1. Generate 10 7 random streams of gaps according to several probability distributions 2. Compute and average I. The length of a segment found by the algorithm that computes the smallest PLA, adopted in the PGM-index II. The exit time of the random walk 15

  16. Simulations of (𝜈 * /𝜏 * )𝜁 * OPT = Average segment length in a PGM-index MET = Mean exit time of the random walk Pareto k = 3 , Ξ± = 3 Lognormal Β΅ = 1 , Οƒ = 0 . 5 Mean segment length Β· 10 6 OPT OPT 1 . 5 MET MET Thm 1 (3 . 521 Ξ΅ 2 ) Thm 1 (3 . 0 Ξ΅ 2 ) 1 0 . 5 0 250 0 50 100 150 200 250 250 0 50 100 150 200 250 Ξ΅ Ξ΅ Both OPT and MET agree on the slope 1/ Β΅ , but OPT is more robust More distributions in the paper 16

  17. Stress test of β€œ 𝜁 sufficiently larger than 𝜏/𝜈 ” Οƒ /Β΅ = 0 . 15 Οƒ /Β΅ = 1 . 5 Οƒ /Β΅ = 15 1 Pareto k = 10 , Ξ± = 7 . 741 Pareto k = 10 , Ξ± = 2 . 202 0 . 2 Gamma ΞΈ = 5 , k = 44 . 444 Gamma ΞΈ = 5 , k = 0 . 444 0 . 8 Lognormal Β΅ = 2 , Οƒ = 0 . 149 Lognormal Β΅ = 2 , Οƒ = 1 . 086 Relative error 44 . 444 Ξ΅ 2 0 . 444 Ξ΅ 2 0 . 6 0 . 5 0 . 1 0 . 4 Pareto k = 10 , Ξ± = 2 . 002 Gamma ΞΈ = 5 , k = 0 . 004 0 . 2 Lognormal Β΅ = 2 , Οƒ = 2 . 328 0 . 004 Ξ΅ 2 0 0 0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Ξ΅ Ξ΅ Ξ΅ 17

  18. Conclusions β€’ No theoretical grounds for the efficiency of learned indexes was known β€’ We have shown that on data with iid gaps, the mean segment length is Θ(𝜁 J ) β€’ The PGM-index takes O(π‘œ/𝜁 J ) space w.h.p., a quadratic improvement in 𝜁 over traditional indexes ( 𝜁 is usually of the order of 100–1000) β€’ Open problems : 1. Do the results still hold without the iid assumption on the gaps? 2. Is the segment found by the optimal algorithm adopted in the PGM-index a constant factor longer than the one found by the random walker? 18

Recommend


More recommend