' $ \Similarit y Query Pro cessing using Disk Arra ys" - PowerPoint PPT Presentation

' $ \Similarit y Query Pro cessing using Disk Arra ys" Ap ostolos N. P apadop oulos & Y annis Manolop oulos Departmen t of Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference, Seattle, June 1998 & % 1

' $ Outline � In tro duction (Disk Arra ys & Similarit y Queries) � Assumptions & Problem De�nition � Similarit y Searc h Algorithms � P erformance Ev aluation � Concluding Remarks & F uture W ork & % 2

' $ In tro duction - Disk Arra ys � P o w erful storage media of increasing imp ortance. � Can handle large n um b er of requests. � Exploit I/O parallelism (b oth in terquery & in traquery). � F ault toleran t (e.g. RAID lev els 1, 3, 5). & % 3

' $ In tro duction - Disk Arra ys (con t.) CPU DMA cont. MEMORY SCSI BUS cont. cont. cont. cont. & % 4

' $ In tro duction - Similarit y Queries W e fo cus on the v ector mo del, where an ob ject is represen ted b y a set of attributes, comp osing a v ector in a m ulti-dimensional space. There are t w o fundamen tal t yp es of similarit y queries that can b e applied in the v ector mo del: � Query , where the user sp eci�es a shap e and asks for all R ange the ob jects falling inside the corresp onding region, � Query , where the user giv es an ob ject, and Ne ar est-Neighb or requests for the nearest ob jects. k & % 5

' $ Assumptions � The set of n -d p oin ts is organized in an R -tree. � � The R -tree is partitioned in the disk arra y b y means of the � heuristic. Pr oximity Index � The partitioning is p erformed no de-wise, i.e. after a split, the new page is assigned to a disk. & % 6

' $ Problem De�nition a set of ob jects ( n -d p oin ts), a query ob ject P , and an Giv en q in teger n um b er , k an e�cien t plan to access the parallel R -tree, in order to determine � rep ort the k nearest neigh b ors of P , q to : trying (i) maximize parallelism, (ii) access as few no des as p ossible, and (iii) reduce query resp onse time. & % 7

' $ Similarit y Searc h Algorithms - Distances D min D mm R2 D max R1 Pq � D ( P ; R ): the min distance b et w een a p oin t and an MBR. min q x � ( P ): ensures the existence of at least one p oin t. D ; R mm q x � D ( P ; R ): the max distance b et w een a p oin t and an MBR. max q x & % 8

' $ Similarit y Searc h Algorithms - BBSS � Prop osed b y Roussop oulos et. al., for answ ering NN queries in R-trees. � It is a branc h-and-b ound algorithm, and in eac h step, a new no de is accessed according to the distance b et w een the query p oin t and the no de MBR. � The distance of the query p oin t to an MBR can b e either the (optimistic), or the (p essimistic). Exp erimen ts ha v e D D min mm demonstrated that using D is more e�cien t. min Limitation: in traquery parallelism can not b e exploited, since eac h & time a single no de is accessed. % 9

' $ Similarit y Searc h Algorithms - FPSS � It op erates in a greedy philosoph y , trying to access in parallel as man y no des as p ossible. � If a no de MBR is in tersected b y the curren t query h yp ersphere, then the no de is accessed, otherwise it is rejected. � The algorithm �rst determines a threshold distance D and thr es then descen ts the R -tree, fetc hing the no des from the � corresp onding disks. Limitation: a large n um b er of no des is accessed, leading to p erformance degradation. & % 10

' $ Similarit y Searc h Algorithms - FPSS (con t.) 10 R3 10 P R1 5 R2 Let =5. The circle determined b y and ( P ), guaran tees k P D ; R max 2 the existence of � 5 p oin ts. FPSS fetc hes ALL pages that in tersect the circle (i.e. , and ). The pro cess is applied to all R -tree R R R � 1 2 3 & % lev els. 11

' $ Similarit y Searc h Algorithms - CRSS Candidate Reduction Criterion: Giv en a query p oin t P , a threshold distance D and a set of MBRs q th R = f R g then for a 2 R : ; :::; R R m x 1 � if D < D ( P ; R ), then R is rejected. th min q x x � if � ( P ), then is set activ e. D D ; R R th mm q x x � if � ( P ) and ( P ), then is D D ; R D < D ; R R th min q x th mm q x x sa v ed for p ossible future reference. & % 12

' $ Similarit y Searc h Algorithms - CRSS (con t.) 10 R3 10 P R1 5 R2 MBRs and will b ecome activ e, and the corresp onding pages R R 1 2 will b e fetc hed, whereas MBR will b e sa v ed as a candidate for R 3 future reference, since ( P ) and ( P ). D > D ; R D < D ; R th min 3 th mm 3 & % 13

' $ Similarit y Searc h Algorithms - CRSS (con t.) The CRSS algorithm op erates in four mo des: 1. The algorithm op erates in AD APTIVE mo de un til the leaf-lev el is reac hed for the �rst time. Distance is adapted. D th 2. Ev ery time the leaf-lev el is reac hed, the algorithm passes to UPD A TE mo de. The b est distances are (p ossibly) up dated. k 3. The NORMAL mo de refers to cases where the algorithm op erates in an in termediate tree-lev el, but after the AD APTIVE mo de. 4. The TERMINA TE mo de signals that there are no candidate no des left, and the NNs ha v e b een determined. k & % 14

' $ Similarit y Searc h Algorithms - CRSS (con t.) Imp ortan t Optimizations: � Let N D denote the n um b er of disks, and AN the n um b er of activ e no des. If , then only pages will b e fetc hed. AN > N D N D Thanks to the e�ciency of the sc heme, w e Pr oximity Index an ticipate that these no des are assigned to di�eren t disks. The rest AN � N D no des are sa v ed as candidates. � During the AD APTIVE mo de it is imp ortan t that the activ e no des con tain � ob jects. This guaran tees that when the k leaf-lev el is reac hed for the �rst time, � distances are a v ailable. k (In eac h no de, a sp ecial �eld giv es the n um b er of ob jects lo cated under the corresp onding subtree). & % 15

' $ Similarit y Searc h Algorithms - OPTIMAL De�nition: 1. A similarit y searc h algorithm is called if exactly strict optimal k ob jects are insp ected, when answ ering a k -NN query . 2. A similarit y searc h algorithm is called if the we ak optimal minim um n um b er of pages is retriev ed, when answ ering a k -NN query . Observ ation: Algorithms BBSS, FPSS and CRSS are neither strict optimal nor w eak optimal. & % 16

' $ Similarit y Searc h Algorithms - OPTIMAL (con t.) W e assume a h yp othetical w eak optimal algorithm (W OPTSS). Let the distance D from the query p oin t P to its k -th nearest neigh b or k q b e kno wn in adv ance. Then, W OPTSS will retriev e the pages that in tersect the h yp ersphere with cen ter and radius . P D q k The n um b er of pages retriev ed b y this algorithms serv es as a lo w er b ound for an y similarit y searc h algorithm. & % 17

' $ P erformance Ev aluation The sim ulation mo del is depicted b elo w. new queries pending disk requests �� CPU �� RAM �� DMA �� pending bus requests �� I/O bus �� & % 18

' $ P erformance Ev aluation (con t.) Set: Gaussian, Population: 80000, Disks: 10, Dimensions: 10 Number of Accessed Nodes (normalized to WOPTSS) 1.14 1.12 BBSS CRSS WOPTSS 1.1 1.08 1.06 1.04 1.02 1 0.98 0.96 0 100 200 300 400 500 600 700 Nearest Neighbors Requested (1 - 700) & % 19

' $ P erformance Ev aluation (con t.) Set: California, Population: 62173, Disks: 10, NNs: 100, Dimensions: 2 0.4 BBSS FPSS 0.35 CRSS WOPTSS 0.3 Mean Response Time (sec) 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 Queries per second (0.1 - 20) & % 20

' $ P erformance Ev aluation (con t.) Set: Gaussian, Population: 50000, Dimensions: 5, NNs: 10 8 7 BBSS CRSS WOPTSS Normalized Mean Response Time 6 5 4 3 2 1 0 5 10 15 20 25 30 Number of Disks (1 - 30) & % 21

' $ \Similarit y Query Pro cessing using Disk Arra ys" - PowerPoint PPT Presentation

' $ \Similarit y Query Pro cessing using Disk Arra ys" Ap ostolos N. P apadop oulos & Y annis Manolop oulos Departmen t of Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference,

Disk Management Disk Structure Disk Scheduling RAID Disk Block Management

Disk Storage Disk Storage Different types of disk storage: The smallest addressable unit

CPSC 410/611: Disk Management Disk Structure Disk Scheduling RAID Disk Block

American Recovery and Reinvestment Act (ARRA) Quarterly Expenditure Reporting ARRA State Agency

Reinvestment Act of 2009 (ARRA) 1 ARRA Temporarily Creates a Distinct Program Within the

American R Recovery and d Reinvestment Act of Reinvestment Act of 2009 ARRA ARRA

1 2 Single Disk (a) Side view of a magnetic disk. (b) Top view of a magnetic disk. 3

CPSC 410/611: Disk Management Disk Structure Disk Scheduling RAID

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Today How is data saved in the hard disk? Magnetic disk Disk speed parameters Disk

CPSC 410/ 611: Week 9 Disk St ruct ure Disk Scheduling RAI D Disk Block

ARRA Incentive Funds Approval to Combine ARRA Balances into Regular Incentive Funds No

RSVP Message Pro cessing Rules draft-lindell-rsvp-pro crules-00.txt Bob Lindell USC/ISI

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

The Human Ov erview Human can b e view ed as an information pro cessing system, for

HARD DISK DRIVES Performance Storage capacity Software support Reliability Why we

Reliably Erasing Data from Flash-Based Solid State Drives Michael Wei* Laura Grupp*, Fredrick E.

Morphological quenching Increased stability for gas disks in early-type galaxies Marie Martig

Hard Disk Drives Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Fall 2017 ::

Ziggurat: A Tiered File System for Non-Volatile Main Memories and Disks Shengan Zheng ,

Storage: The Unnoticed Revolution Jerome H. Saltzer M. I. T. / L. C. S.

Our Galaxy Chapter 19 19.1 The Milky Way Revealed Our goals for learning What does our

Fast, Scalable Disk Imaging with Frisbee University of Utah Mike Hibler, Leigh Stoller, Jay

The Coolness of Reliability and other tales Ali R. Butt Disk Storage Requirements