TEXAS TECH UNIVERSITY DART: Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems Wei Zhang, Houjun Tang, Suren Byna, Yong Chen November 2 nd , 2018 The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT18)
Exponential Data Growth
Mind-blowing Information Explosion
Affix-based Keyword Search Infix : *FF* AFFIX Prefix : AF* Suffix : *FIX
Document-partitioned Approach I love apple I love avocado I love banana lo* apple Query Broadcasting
Term-partitioned Approach - Full String Hashing I love apple I love avocado I love banana lo* apple apple, avocado, banana, I, love Query Broadcasting apple avocado banana I love ...
Term-partitioned Approach - Initial Hashing I love apple I love avocado I love banana lo* apple apple, avocado, banana, I, love No Query Broadcasting apple Load Balance banana avocado I love
Term-partitioned Approach - Initial Hashing 126k 60k UUID DICT 50k 126k 40k 125k 30k 125k 20k 124k 10k 124k 0 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Imbalance of Keyword Distribution 3500k WIKI 3000k 2500k 2000k 1500k 1000k 500k 0 ! # $ % & ' ( ) * + , - 0 1 2 3 4 5 6 7 8 9 : ; < > ? @ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }
Skewness in Keyword Popularity
Requirements of Distributed Affix-based Keyword Search • Prefix Search • Avoid Query Broadcasting • Suffix Search • Document-partitioned Approach • Infix Search • Full String Hashing • Exact Search Functionality Efficiency Load Scalability Balance • Functionality • Imbalanced Keyword Distribution • Efficiency • Skewness of Keyword • Load Balance Popularity
DART: Distributed Adaptive Radix Tree DART Partition Tree Initialization • Character set A, let k = |A| (Radix of DART) • M = total # of physical machines • For a partition tree of height d , at each level ! ∈ {1, … , '} , each tree node branches out to level ! + 1 by iterating each character in the character set A in order. Thus, * +,-. = 0 1 , and 2 345678-+ = 2 +,-. %: • • We need to ensure * +,-. ≥ : , thus: ' = <log @ : + 1 ⌉ • • Client-side arithmetic calculation. • B(1) Complexity I JKLM • Root Region E FGGH = virtual nodes @ I JKLM • Subregion E 6NO = virtual nodes @ P
DART: Distributed Adaptive Radix Tree Index Construction - Overview • For each term, create index for it and its inverse, e.g., ”abc” and “cba” Select base virtual node • Select alternative virtual node • • Select eventual virtual node which has lesser indexed keywords to create the index for the keyword. Goal : Balance Keyword Distribution • Hint : The power of 2-choices • • Randomness can lead to balanced keyword distribution, but will result in query broadcasting. Destined keyword placement ensures efficient look • up, but leads to imbalanced keyword distribution. Randomness Certainty
DART: Distributed Adaptive Radix Tree Index Construction – Base Virtual Node Selection • For term ! = ($ % $ & … $ ( ) , let * + , be the index of character $ - in the character set A . • When . ≥ 0 , the client calculates: 5 * + , ×7 58- 1 2 = ∑ -4% • • E.g. 0 = 3, ; = {;, =, >} , for “CBCBA” • When . < 0 , the client pad the term with its ending character until . = 0 . • Perform the above calculation. • E.g. 0 = 3, ; = {;, =, >} , for “AA”, pad “AA” to “AAA” Certainty
DART: Distributed Adaptive Radix Tree Index Construction – Alternative Virtual Node Selection . $ / + 1 • ! "#$%&_&%()*+_,$"&$ = %4 #%"5 ×7 &**$ 2 E.g. 8 = 3, ; = {;, =, >} , for “CBCBA” • @ A = . $ BC/ + . $ B + . $ BD/ % E • @ 2 = . $ BC/ − . $ B − . $ BD/ % E • ! GH = ! "#$%&_&%()*+_,$"&$ • + ! G + @ A ×7 ,IJ + @ 2 %7 &**$ Randomness Certainty
DART: Distributed Adaptive Radix Tree Index Construction – Eventual Node Selection • Select node between ! " and ! "# Let $ " = & ! " , ! " ≤ |! "# | • ! "# , *+ℎ-./01- Balanced Keyword Distribution
DART: Distributed Adaptive Radix Tree Index Construction – Index Replication • To overcome skewness of keyword popularity. Replication Factor r • ' ()*+ The i th replica, ! " = $ % + × ., . ∈ [1, 3] • , • E.g. r = 3 Replicas will be accessed in round-robin • fashion. Alleviate Excessive Access on Popular Keywords
DART: Distributed Adaptive Radix Tree Prefix Suffix Query Response – Prefix and Suffix Queries Query Query Prefix Query Base Virtual Node Selection & Alternative Virtual Node Selection Access both virtual nodes & Take the result from the node which returns ! "#$%&' ≥ ) for ”CBCB*”, non-empty result 2 nodes will be accessed.
DART: Distributed Adaptive Radix Tree Prefix Suffix Query Response – Prefix and Suffix Queries Query Query Prefix Query Base Root Region & Alternative Root Region Scan both Root Regions & ! "#$%&' < ) , for “CB*” OR “C*”, Collect the results 2M/k nodes will be scanned
DART: Distributed Adaptive Radix Tree Query Response – Infix Query • The position of a given infix is uncertain in a keyword. • Query broadcasting is inevitable. • To the best of our knowledge, there is no indexing technique that can avoid full scan on the indexed keywords when it comes to infix query.
DART: Distributed Adaptive Radix Tree Complexity of DART Operations
DART: Distributed Adaptive Radix Tree Experimental Setup • Platform – Cori @ NERSC (2388 nodes in total) • 8 – 512 nodes (1/4 nodes occupied) • Half client half server UUID DICT • Dataset – 126k 60k 50k 126k 40k • UUID – generated by libuuid 125k 30k 125k 20k 124k 10k • DICT – comprehensive keyword set in 0 124k A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 a b c d e f WIKI natural language 3500k 3000k • WIKI – comprehensive real world queries 2500k 2000k 1500k • Query – 1000k 500k 0 • 4-letter prefix/suffix/infix and Exact ! # $ % & ' ( ) * + , - 0 1 2 3 4 5 6 7 8 9 : ; < > ? @ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } keyword • DART partition tree height ranges from 2 to 3 for 4 - 256 server nodes, given 128 characters in standard ASCII.
DART: Distributed Adaptive Radix Tree Query Throughput (TPS) Prefix Query Suffix Query Insert Delete Infix Query Exact Query
DART: Distributed Adaptive Radix Tree Latency of DART Operations
DART: Distributed Adaptive Radix Tree DICT Keyword Dist. UUID Keyword Dist. Load Balance (Measured by CV) Coefficient of Variance (CV) • “Normalized Standard Deviation” • Fair measurement for data • dispersion regardless of size of WIKI Keyword Dist. WIKI Request Dist. (r=3) the dataset $ ! " = • % & =standard deviation • ' =mean •
DART: Distributed Adaptive Radix Tree Alleviate Excessive Query Accesses on Popular Keywords 1.45 CV of WIKI Request 1.4 1.35 Distribution 1.3 1.25 1.2 1.15 1.1 r=1 r=3 r=4 r=5 Replication Factor
DART: Distributed Adaptive Radix Tree Functionality : DART enables affix- DART can be used in many • • based keyword search in distributed scenarios, such as serving environment. wildcard query in • Efficiency : DART outperforms full Distributed object-centric • string hashing in terms of search storage systems efficiency on prefix search and suffix Distributed metadata • search. management system • Load Balance : DART outperforms Distributed graph storage • initial hashing in terms of keyword distribution and generally alleviates systems (properties of excessive query workload on popular property graph) keywords. Distributed database for • • Scalability : Effective on different information retrieval and scale. knowledge discovery. ...... •
Acknowledgement • This research is supported in part by the National Science Foundation under grant CNS-1338078, IIP-1362134, CCF-1409946, and CCF-1718336. • This work is supported in part by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC0205CH11231. (Project: Proactive Data Containers, Program manager: Dr. Lucy Nowell). • This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility.
Scan QR Code to Follow Up Paper: Contact Us: Citation: DISCL @ TTU: BibTex: https://discl.cs.ttu.edu/ SDM Group @ LBNL Text: http://sdm.lbl.gov/
Recommend
More recommend