depth estimation for ranking query optimization
play

Depth Estimation for Ranking Query Optimization - PowerPoint PPT Presentation

Depth Estimation for Ranking Query Optimization KarlSchnaitter,UCSantaCruz JoshuaSpiegel,BEASystems,Inc. NeoklisPolyzotis,UCSantaCruz Relational Ranking Queries SELECT h.hid, r.rid, e.eid FROM Hotels h,


  1. Depth Estimation for Ranking Query Optimization Karl�Schnaitter,�UC�Santa�Cruz Joshua�Spiegel,�BEA�Systems,�Inc. Neoklis�Polyzotis,�UC�Santa�Cruz

  2. Relational Ranking Queries SELECT h.hid, r.rid, e.eid FROM Hotels h, Restaurants r, Events e WHERE h.city = r.city AND r.city = e.city RANK BY 0.3/h.price + 0.5*r.rating + 0.2*isMusic(e) LIMIT 10 • A� ���������� for�each�table�in�[0,1] � � (h) = 1/h.price Hotels: � � (r) = r.rating Restaurants: � � (e) = isMusic(e) Events: • Combined�with�a� ������������������ � ( � � ,� � � ,� � � )�=�0.3* � � +�0.5* � � +�0.2* � � • Return�top� �� results�based�on� � In�this�case,� � =�10 2

  3. Ranking Query Execution SELECT h.hid, r.rid, e.eid FROM Hotels h, Restaurants r, Events e WHERE h.city = r.city AND r.city = e.city RANK BY 0.3/h.price + 0.5*r.rating + 0.2*isMusic(e) LIMIT 10 conventional�plan rank4aware�plan ���������������� ���������������� ��������� Rank join Join Rank join � Join � � � � � Ordered 3 by score

  4. Depth Estimation • ����� : number�of�accessed�tuples – Indicates�execution�cost – Linked�to�memory�consumption Rank join ���������� � � ����������� • ������������� Estimate�depths�for� each operator�in�a�rank4aware�plan 4

  5. Depth Estimation Methods • Ilyas�et�al.�(SIGMOD�2004) – Uses�probabilistic�model�of�data – Assumes�relations�of�equal�size�and�a� scoring�function�that�sums�scores – Limited�applicability • Li�et�al.�(SIGMOD�2005) – Samples�a�subset�of�rows�from�each�table – Independent�samples�give�a�poor�model� of�join�results 5

  6. Our Solution: DEEP • �� pth� � stimation�for� � hysical�plans • Strengths�of�DEEP – A�principled methodology • Uses statistical�model�of�data�distribution • Formally�computes�depth�over�statistics – Efficient�estimation�algorithms – Widely�applicable • Works�with state4of4the4art�physical�plans • Realizable�with�common�data�synopses 6

  7. Outline • Preliminaries • DEEP�Framework • Experimental�Results 7

  8. Outline • Preliminaries • DEEP�Framework • Experimental�Results 8

  9. Monotonic Functions • A�function� � ( � 1 ,..., � � )�is� ��������� if � � ( � � ≤y � )� � � ( � 1 ,..., � � )�≤ � (y 1 ,...,y � ) f(x) x 9

  10. Monotonic Functions • A�function� � ( � 1 ,..., � � )�is� ��������� if � � ( � � ≤y � )� � � ( � 1 ,..., � � )�≤ � (y 1 ,...,y � ) • Most�scoring�functions�are�monotonic – E.g.�sum,�product,�avg,�max,�min • Monotonicity�enables�bound�on�score – In�example�query,�score�was 0.3/h.price�+�0.5*r.rating�+�0.2*isMusic(e) – Given�a�restaurant� � ,�upper�bound�is 0.3*1 +�0.5*r.rating�+�0.2*1 10

  11. Hash Rank Join [IAE04] • The� ���������� ��� algorithm – Joins�inputs�sorted�by�score – Returns�results�with�highest�score • Main�ideas – Alternate�between�inputs�based�on� ������������! – Score�bounds�allow�early�termination Bound: 1.8 Bound: 1.7 L a b L R a b R Query: Top result from L R with scoring function x 1.0 y 1.0 S ( b L , b R ) = b L + b R y 0.8 z 0.9 Result: y — — — w 0.7 Score: 1.8 — — — 11

  12. HRJN* [IAE04] • The�HRJN*�pull�strategy: a) Pull�from�the�input�with�highest�bound b) If�(a)�is�a�tie,�pull�from�input�with�the� smaller�number�of�pulls�so�far c) If�(b)�is�a�tie,�pull�from�the�left Bound: 2.0 1.8 Bound: 2.0 1.9 1.7 Query: Top result from L R L a b L R a b R with scoring function x 1.0 y 1.0 S ( b L , b R ) = b L + b R y 0.8 z 0.9 Result: y � ? w 0.7 Score: 1.8 12

  13. Outline • Preliminaries • DEEP�Framework • Experimental�Results 13

  14. Supported Operators Evidence�in�favor�of�HRJN* – Pull�strategy�has�strong�properties� • Within�constant�factor�of�optimal�cost • Optimal�for�a�significant�class�of�inputs • More�details�in�the�paper – Efficient�in experiments�[IAE04] � DEEP�explicitly�supports�HRJN* – Easily�extended�to�other�join�operators – Selection�operators�too 14

  15. DEEP: Conceptual View ������������� defined in Depth terms of Statistical Computation Data Model �������������� defined in Estimation terms of Statistics Data Algorithms Interface Synopsis 15

  16. Statistics Model • Statistics�yield�the�distribution�of� scores�for�base�tables�and�joins � " � " ( � " ) F L 1.0 5 0.9 2 F L R b L b R F L R ( b L , b R ) 0.8 3 1.0 1.0 6 0.6 12 1.0 0.5 4 0.4 8 1.0 0.7 3 0.9 0.7 2 � � � � ( � � ) F R 0.6 0.7 2 1.0 3 0.7 1 0.5 2 16

  17. Statistics Interface • DEEP�accesses�statistics�with�two�methods – ������# ( � ):�Return�frequency�of� � – ��������� ( � , � ):�Return�next�lowest�score�on�dimension� � b L b R F L R ( b L , b R ) 1.0 1.0 6 ������# ( � )�=�3 1.0 0.5 4 ��������� ( � ,1)=0.9 � 1.0 0.7 3 ��������� ( � ,2)=0.5 0.9 0.7 2 0.6 0.7 2 • The interface�allows�for�efficient�algorithms – Abstracts�the�physical statistics�format – Allows�statistics�to�be�generated�on4the4fly 17

  18. Statistics Implementation • Interface�can�be�implemented�over� common�types�of�data�synopses • Can�use�a�histogram�if a) Base�score�function�is�invertible,�or b) Base�score�measures�distance • Assume�uniformity�&�independence�if a) Base�score�function�is�too�complex,�or b) Sufficient�statistics�are�not�available 18

  19. Depth Estimation Overview Top4 � query�plan Estimates�made Value Score of the k th best s 1 1. s 1 tuple out of 1 1 2. Depths of needed l 1 and r 1 l 1 r 1 1 to output score of s 1 th best C s 2 3. Score of the l 1 s 2 tuple out of 2 2 l 2 r 2 4. Depths of needed l 2 and r 2 2 to output score of s 2 B A 19

  20. Estimating Terminal Score • Suppose�we�want� • Idea the�10 th best�score – Sort�by�total�score – Sum�frequencies b L b R F L R ( b L , b R ) b L + b R F L R ( b L ,b R ) sum 1.0 1.0 6 2.0 6 6 1.0 0.5 4 1.7 3 9 1.0 0.7 3 1.6 2 11 0.9 0.7 2 1.5 4 0.6 0.7 2 1.3 2 � term =�1.6 20

  21. Estimation Algorithm • Idea:�Only�process�necessary�statistics 1 0.7 0.5 b L b R F L R ( b L , b R ) 6 3 4 1 1.0 1.0 6 2 0.9 1.0 0.5 4 1.0 0.7 3 0.8 0.9 0.7 2 2 0.6 0.7 2 0.6 � term =�1.6 • Algorithm�relies�solely�on� ������# and� ��������� – Avoids�materializing�complete�table • Worst4case�complexity�equivalent�to�sorting�table – More�efficient�in�practice 21

  22. Depth Estimation Overview Top4 � query�plan Estimates�made Value Score of the k th best s 1 1. s 1 tuple out of 1 1 2. Depths of needed l 1 and r 1 l 1 r 1 1 to output score of s 1 th best C s 2 3. Score of the l 1 s 2 tuple out of 2 2 l 2 r 2 4. Depths of needed l 2 and r 2 2 to output score of s 2 B A 22

  23. Estimating Depth for HRJN* �������� � ≤ ≤ depth�of�HRJN*� ≤ $ ≤ ≤ Input Score Bounds Example: � term =�1.6 >�S term � " � " ( � " ) � " + 1 � " ( � " ) > S term � > S term 1.0 5 2.0 5 >�S term 1.9 2 0.9 2 $ =�S term 1.8 3 0.8 3 = S term 1.6 4 0.6 12 =�S term 0.4 8 1.4 8 <�S term 11� ≤ depth� ≤ 15 <�S term • Estimation�algorithm <�S term < S term – Access�via� ������# and� ��������� <�S term – Similar�to�estimation�of � term 23

  24. Outline • Preliminaries • DEEP�Framework • Experimental�Results 24

  25. Experimental Setting • TPC4H�data�set – Total�size�of�1�GB – Varying�amount�of�skew • Workloads�of�250�queries – Top410,�top4100,�top41000�queries – One�or�two�joins�per�query • Error�metric:� ���������������%������� 25

  26. Depth Estimation Techniques • DEEP – Uses 150�KB�TuG�synopsis�[SP06] • Probabilistic�[IAE04] – Uses�same�TuG�synopsis – Modified�to�handle�single4join�queries� with�varying�table�sizes • Sampling�[LCIS05] – 5%�sample�=�4.6�MB 26

Recommend


More recommend