 
              Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna
A classical problem in computer science β’ Given a set of π sorted input keys (e.g. integers) β’ Implement membership and predecessor queries β’ Range queries in databases, conjunctive queries in search engines, IP lookup in routersβ¦ ππππππ 36 = True 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π ππ ππππππ‘π‘ππ 50 = 48 2
Indexes πππ§ B-tree πππ‘ππ’πππ 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π 3
Input data as pairs (πππ§, πππ‘ππ’πππ) positions keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π 4
Input data as pairs (πππ§, πππ‘ππ’πππ) 4 3 positions 2 1 11 13 15 2 keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 2 3 4 π 5
Learned indexes πππ§ Black-box trained on a dataset of pairs (key, pos) π = { 2,1 , 11,2 , β¦ , (95, π)} positions keys (approximate) πππ‘ππ’πππ 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π Binary search in πππ‘ππ’πππ β π, πππ‘ππ’πππ + π e.g. π is of the order of 100β1000 6
The knowledge gap in learned indexes Practice Theory Same query time of Same asymptotic query vs π traditional tree-based time of traditional indexes tree-based indexes Space improvements of Same asymptotic space vs π orders of magnitude, occupancy of traditional from GBs to few MBs tree-based indexes 7
[Ferragina and Vinciguerra, PVLDB 2020] PGM-index: An optimal learned index 1. Fix a max error π , e.g. so that keys in [πππ‘ β π,πππ‘ + π] fit a cache-line 2. Find the smallest Piecewise Linear π -Approximation (PLA) 3. Store triples (πππ π‘π’πππ§, π‘ππππ, πππ’ππ ππππ’) for each segment positions 8 24 keys 1 3 8 11 12 19 22 23 24 28 29 33 38 47 48 53 55 56 57 8 https://pgm.di.unipi.it 8 πππ‘ β π, πππ‘ + π
What is the space of learned indexes? β’ Space occupancy β Number segments β’ The number of segments depends on β’ The size of the input dataset β’ How the points (πππ§, πππ‘) map to the plane β’ The value π , i.e. how much the approximation is precise π ! π " βͺ π ! positions positions positions keys keys keys 9
Model and assumptions β’ Consider gaps π ! = π !"# β π ! between consecutive input keys β’ Model the gaps as positive iid rvs that follow a distribution with finite mean π and variance π $ 5 π ( 4 positions π ) 3 π * 2 π + 1 π " π # π $ π % π & keys 10
The main result Theorem . If π is sufficiently larger than π/π , the expected number of keys covered by a segment with maximum error π is πΏ = π $ π $ π $ and the number of segments on a dataset of size π is π πΏ with high probability . 11
The main consequence The PGM-index achieves the same asymptotic query performance of a traditional π -way tree-based index while improving its space from π€(π/π») to π·(π/π» π ) Learned indexes are pr provably better than traditional indexes (note that π is of the order of 100-1000) 12
Sketch of the proof 1. Consider a segment on the stream of random gaps and the two parallel lines at distance π 2. How many steps before a new segment is needed? π π positions Start a new segment from here keys 13
Sketch of the proof (2) 3. A discrete-time random walk, iid increments with mean π 4. Compute the expectation of π β = min π β β π I , π is outside the red strip i.e. the Mean Exit Time (MET) of the random walk Show that the slope π = 1/π maximises πΉ[π β ] , giving πΉ[π β ] = π J /π J π J 5. Start a new π random walker location segment from here π (πππ§ ! β , π β ) π positions π time Start a new π β segment π from here π 14 keys
Simulations 1. Generate 10 7 random streams of gaps according to several probability distributions 2. Compute and average I. The length of a segment found by the algorithm that computes the smallest PLA, adopted in the PGM-index II. The exit time of the random walk 15
Simulations of (π * /π * )π * OPT = Average segment length in a PGM-index MET = Mean exit time of the random walk Pareto k = 3 , Ξ± = 3 Lognormal Β΅ = 1 , Ο = 0 . 5 Mean segment length Β· 10 6 OPT OPT 1 . 5 MET MET Thm 1 (3 . 521 Ξ΅ 2 ) Thm 1 (3 . 0 Ξ΅ 2 ) 1 0 . 5 0 250 0 50 100 150 200 250 250 0 50 100 150 200 250 Ξ΅ Ξ΅ Both OPT and MET agree on the slope 1/ Β΅ , but OPT is more robust More distributions in the paper 16
Stress test of β π sufficiently larger than π/π β Ο /Β΅ = 0 . 15 Ο /Β΅ = 1 . 5 Ο /Β΅ = 15 1 Pareto k = 10 , Ξ± = 7 . 741 Pareto k = 10 , Ξ± = 2 . 202 0 . 2 Gamma ΞΈ = 5 , k = 44 . 444 Gamma ΞΈ = 5 , k = 0 . 444 0 . 8 Lognormal Β΅ = 2 , Ο = 0 . 149 Lognormal Β΅ = 2 , Ο = 1 . 086 Relative error 44 . 444 Ξ΅ 2 0 . 444 Ξ΅ 2 0 . 6 0 . 5 0 . 1 0 . 4 Pareto k = 10 , Ξ± = 2 . 002 Gamma ΞΈ = 5 , k = 0 . 004 0 . 2 Lognormal Β΅ = 2 , Ο = 2 . 328 0 . 004 Ξ΅ 2 0 0 0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Ξ΅ Ξ΅ Ξ΅ 17
Conclusions β’ No theoretical grounds for the efficiency of learned indexes was known β’ We have shown that on data with iid gaps, the mean segment length is Ξ(π J ) β’ The PGM-index takes O(π/π J ) space w.h.p., a quadratic improvement in π over traditional indexes ( π is usually of the order of 100β1000) β’ Open problems : 1. Do the results still hold without the iid assumption on the gaps? 2. Is the segment found by the optimal algorithm adopted in the PGM-index a constant factor longer than the one found by the random walker? 18
Recommend
More recommend