' $ \Similarit y Query Pro cessing using Disk Arra ys" Ap ostolos N. P apadop oulos & Y annis Manolop oulos Departmen t of Informatics, Aristotle Univ ersit y Thessaloniki, Greece A CM SIGMOD Conference, Seattle, June 1998 & % 1
' $ Outline � In tro duction (Disk Arra ys & Similarit y Queries) � Assumptions & Problem De�nition � Similarit y Searc h Algorithms � P erformance Ev aluation � Concluding Remarks & F uture W ork & % 2
' $ In tro duction - Disk Arra ys � P o w erful storage media of increasing imp ortance. � Can handle large n um b er of requests. � Exploit I/O parallelism (b oth in terquery & in traquery). � F ault toleran t (e.g. RAID lev els 1, 3, 5). & % 3
' $ In tro duction - Disk Arra ys (con t.) CPU DMA cont. MEMORY SCSI BUS cont. cont. cont. cont. & % 4
' $ In tro duction - Similarit y Queries W e fo cus on the v ector mo del, where an ob ject is represen ted b y a set of attributes, comp osing a v ector in a m ulti-dimensional space. There are t w o fundamen tal t yp es of similarit y queries that can b e applied in the v ector mo del: � Query , where the user sp eci�es a shap e and asks for all R ange the ob jects falling inside the corresp onding region, � Query , where the user giv es an ob ject, and Ne ar est-Neighb or requests for the nearest ob jects. k & % 5
' $ Assumptions � The set of n -d p oin ts is organized in an R -tree. � � The R -tree is partitioned in the disk arra y b y means of the � heuristic. Pr oximity Index � The partitioning is p erformed no de-wise, i.e. after a split, the new page is assigned to a disk. & % 6
' $ Problem De�nition a set of ob jects ( n -d p oin ts), a query ob ject P , and an Giv en q in teger n um b er , k an e�cien t plan to access the parallel R -tree, in order to determine � rep ort the k nearest neigh b ors of P , q to : trying (i) maximize parallelism, (ii) access as few no des as p ossible, and (iii) reduce query resp onse time. & % 7
' $ Similarit y Searc h Algorithms - Distances D min D mm R2 D max R1 Pq � D ( P ; R ): the min distance b et w een a p oin t and an MBR. min q x � ( P ): ensures the existence of at least one p oin t. D ; R mm q x � D ( P ; R ): the max distance b et w een a p oin t and an MBR. max q x & % 8
' $ Similarit y Searc h Algorithms - BBSS � Prop osed b y Roussop oulos et. al., for answ ering NN queries in R-trees. � It is a branc h-and-b ound algorithm, and in eac h step, a new no de is accessed according to the distance b et w een the query p oin t and the no de MBR. � The distance of the query p oin t to an MBR can b e either the (optimistic), or the (p essimistic). Exp erimen ts ha v e D D min mm demonstrated that using D is more e�cien t. min Limitation: in traquery parallelism can not b e exploited, since eac h & time a single no de is accessed. % 9
' $ Similarit y Searc h Algorithms - FPSS � It op erates in a greedy philosoph y , trying to access in parallel as man y no des as p ossible. � If a no de MBR is in tersected b y the curren t query h yp ersphere, then the no de is accessed, otherwise it is rejected. � The algorithm �rst determines a threshold distance D and thr es then descen ts the R -tree, fetc hing the no des from the � corresp onding disks. Limitation: a large n um b er of no des is accessed, leading to p erformance degradation. & % 10
' $ Similarit y Searc h Algorithms - FPSS (con t.) 10 R3 10 P R1 5 R2 Let =5. The circle determined b y and ( P ), guaran tees k P D ; R max 2 the existence of � 5 p oin ts. FPSS fetc hes ALL pages that in tersect the circle (i.e. , and ). The pro cess is applied to all R -tree R R R � 1 2 3 & % lev els. 11
' $ Similarit y Searc h Algorithms - CRSS Candidate Reduction Criterion: Giv en a query p oin t P , a threshold distance D and a set of MBRs q th R = f R g then for a 2 R : ; :::; R R m x 1 � if D < D ( P ; R ), then R is rejected. th min q x x � if � ( P ), then is set activ e. D D ; R R th mm q x x � if � ( P ) and ( P ), then is D D ; R D < D ; R R th min q x th mm q x x sa v ed for p ossible future reference. & % 12
' $ Similarit y Searc h Algorithms - CRSS (con t.) 10 R3 10 P R1 5 R2 MBRs and will b ecome activ e, and the corresp onding pages R R 1 2 will b e fetc hed, whereas MBR will b e sa v ed as a candidate for R 3 future reference, since ( P ) and ( P ). D > D ; R D < D ; R th min 3 th mm 3 & % 13
' $ Similarit y Searc h Algorithms - CRSS (con t.) The CRSS algorithm op erates in four mo des: 1. The algorithm op erates in AD APTIVE mo de un til the leaf-lev el is reac hed for the �rst time. Distance is adapted. D th 2. Ev ery time the leaf-lev el is reac hed, the algorithm passes to UPD A TE mo de. The b est distances are (p ossibly) up dated. k 3. The NORMAL mo de refers to cases where the algorithm op erates in an in termediate tree-lev el, but after the AD APTIVE mo de. 4. The TERMINA TE mo de signals that there are no candidate no des left, and the NNs ha v e b een determined. k & % 14
' $ Similarit y Searc h Algorithms - CRSS (con t.) Imp ortan t Optimizations: � Let N D denote the n um b er of disks, and AN the n um b er of activ e no des. If , then only pages will b e fetc hed. AN > N D N D Thanks to the e�ciency of the sc heme, w e Pr oximity Index an ticipate that these no des are assigned to di�eren t disks. The rest AN � N D no des are sa v ed as candidates. � During the AD APTIVE mo de it is imp ortan t that the activ e no des con tain � ob jects. This guaran tees that when the k leaf-lev el is reac hed for the �rst time, � distances are a v ailable. k (In eac h no de, a sp ecial �eld giv es the n um b er of ob jects lo cated under the corresp onding subtree). & % 15
' $ Similarit y Searc h Algorithms - OPTIMAL De�nition: 1. A similarit y searc h algorithm is called if exactly strict optimal k ob jects are insp ected, when answ ering a k -NN query . 2. A similarit y searc h algorithm is called if the we ak optimal minim um n um b er of pages is retriev ed, when answ ering a k -NN query . Observ ation: Algorithms BBSS, FPSS and CRSS are neither strict optimal nor w eak optimal. & % 16
' $ Similarit y Searc h Algorithms - OPTIMAL (con t.) W e assume a h yp othetical w eak optimal algorithm (W OPTSS). Let the distance D from the query p oin t P to its k -th nearest neigh b or k q b e kno wn in adv ance. Then, W OPTSS will retriev e the pages that in tersect the h yp ersphere with cen ter and radius . P D q k The n um b er of pages retriev ed b y this algorithms serv es as a lo w er b ound for an y similarit y searc h algorithm. & % 17
' $ P erformance Ev aluation The sim ulation mo del is depicted b elo w. new queries pending disk requests �� �� �� �� �� �� �� �� CPU �� �� �� �� �� �� �� �� �� �� RAM �� �� �� �� �� �� �� �� �� �� �� �� �� �� DMA �� �� pending bus requests ��� ��� I/O bus ��� ��� ��� ��� & % 18
' $ P erformance Ev aluation (con t.) Set: Gaussian, Population: 80000, Disks: 10, Dimensions: 10 Number of Accessed Nodes (normalized to WOPTSS) 1.14 1.12 BBSS CRSS WOPTSS 1.1 1.08 1.06 1.04 1.02 1 0.98 0.96 0 100 200 300 400 500 600 700 Nearest Neighbors Requested (1 - 700) & % 19
' $ P erformance Ev aluation (con t.) Set: California, Population: 62173, Disks: 10, NNs: 100, Dimensions: 2 0.4 BBSS FPSS 0.35 CRSS WOPTSS 0.3 Mean Response Time (sec) 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 Queries per second (0.1 - 20) & % 20
' $ P erformance Ev aluation (con t.) Set: Gaussian, Population: 50000, Dimensions: 5, NNs: 10 8 7 BBSS CRSS WOPTSS Normalized Mean Response Time 6 5 4 3 2 1 0 5 10 15 20 25 30 Number of Disks (1 - 30) & % 21
Recommend
More recommend