compressing and searching xml data via two zips
play

Compressing and Searching XML Data Via Two Zips Paolo Ferragina - PowerPoint PPT Presentation

Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica, Universit di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Paolo Ferragina, Universit di Pisa Six years ago... [now, J. ACM 05]


  1. Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Paolo Ferragina, Università di Pisa

  2. Six years ago... [now, J. ACM 05] Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Survey by Navarro-Makinen cites more than 50 papers on the subject !! Paolo Ferragina, Università di Pisa

  3. An XML excerpt < dblp > < book > < author > Donald E. Knuth </ author > < title > The TeXbook </ title > < publisher > Addison-Wesley </ publisher > It is verbose ! < year > 1986 </ year > </ book > < article > < author > Donald E. Knuth </ author > < author > Ronald W. Moore </ author > < title > An Analysis of Alpha-Beta Pruning </ title > < pages > 293-326 </ pages > < year > 1975 </ year > < volume > 6 </ volume > < journal > Artificial Intelligence </ journal > </ article > ... </ dblp > Paolo Ferragina, Università di Pisa

  4. A tree interpretation... � XML document exploration ≡ Tree navigation � XML document search ≡ Labeled subpath searches Paolo Ferragina, Università di Pisa Subset of XPath [W3C]

  5. The Problem We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations: Navigational operations: parent(u), child(u, i), child(u, i, c ) � Subpath searches: given a sequence Π of k labels � Content searches: subpath + substring search � Visualization operation: given a node, visualize its descending subtree � � XML-aware compressors (like XMill , XmlPpm , ScmPpm ,...) � XML-native search engines need the whole decompression � XML-queriable compressors (like XPress , XGrind , XQzip ,...) might exploit this tool as a core block for � poor compression and scan of the whole (compressed) file query optimization and (compressed) storage � Summary indexes (like Dataguide , 1-index or 2-index ) � large space and do not support “content” searches � Theoretically do exist many solutions, starting from [Jacobson, IEEE Focs ’89] � no subpath/ content searches, and poor performance on labeled trees Paolo Ferragina, Università di Pisa

  6. A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05] � We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings ( do you know bzip !? ). � The XBW linearizes the tree T in 2 arrays s.t.: � the compression of T reduces to use any k -th order entropy compressor (gzip , bzip,...) over these two arrays � the indexing of T reduces to implement simple rank/select query operations over these two arrays Paolo Ferragina, Università di Pisa

  7. The XBW-Transform C S α S π ε C B C B A B D B C c D B C a D B C c B C D c b a D D a A C b A C a A C c a c b D A C c D A C B C Step 1. D B C b D B C Visit the tree in pre-order. Permutation a B C For each node, write down its label of tree nodes and the labels on its upward path upward labeled paths Paolo Ferragina, Università di Pisa

  8. The XBW-Transform C S α S π ε C b A C B A B a A C D A C D B C c B C D c b a D D a D B C a B C B C c a c b A C B C c D A C c D B C Step 2. a D B C b D B C Stably sort according to S π upward labeled paths Paolo Ferragina, Università di Pisa

  9. The XBW-Transform C S π S last S α ε 1 C 0 b A C B A B 0 a A C 1 D A C 0 D B C 1 c B C D c b a D D a 0 D B C 1 a B C 0 B C XBW can be c a c b 0 A C built and inverted 1 B C in optimal O(t) time 1 c D A C 0 c D B C Key fact Step 3. 1 a D B C Add a binary array S last marking the 1 b D B C Nodes correspond to items in < S last ,S α > rows corresponding to last children XBW Paolo Ferragina, Università di Pisa XBW takes optimal t log | Σ | + 2t bits

  10. XBzip – a simple XML compressor Tags, Attributes and symbol = XBW is compressible : � S α and S pcdata are locally homogeneous � S last has some structure Pcdata Paolo Ferragina, Università di Pisa

  11. XBzip = XBW + PPMd 25% 20% 15% 10% 5% 0% DBLP Pathways News gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip String compressors are not so bad: within 5% Paolo Ferragina, Università di Pisa

  12. Some structural properties C C S π S last S α ε 1 C B B B A A B B 0 b A C 0 a A C 1 D A C 0 D B C D D c c b b a a D D D D a a 1 c B C 0 D B C 1 a B C c c a a c c b b 0 B C 0 A C 1 B C Two useful properties: 1 c D A C 0 c D B C • Children are contiguous and delimited by 1s 1 a D B C • Children reflect the order of their parents 1 b D B C XBW Paolo Ferragina, Università di Pisa

  13. XBW is navigational C A 2 C C B 5 S π S last S α C 9 ε D 12 1 C B B A A B B 0 b A C 0 a A C 1 D A C Select in S last 0 D B C the 2° item 1 D D c c b b a a D D D D a a from here... 1 c B C 0 D B C 1 a B C c c a a c c b b 0 B C Get_children 0 A C 1 B C Rank(B,S α )= 2 1 c D A C XBW is navigational: 0 c D B C • Rank-Select data structures on S last and S α 1 a D B C • The array C of | Σ | integers 1 b D B C XBW Paolo Ferragina, Università di Pisa

  14. XBW is searchable (count subpaths) C A 2 C Π [i+ 1] B 5 S π S last S α C 9 Π = B D ε D 12 1 C B A B A C 0 b A C 0 a A C 1 D fr B C 0 D D c b a D D a B C Rows 1 c whose B C 0 D S π starts lr B C with ‘B’ 1 a c a c b C 0 B Their children C have upward 0 A path = ‘D B’ C Inductive step: 1 B D A C XBW is searchable: 1 c � Pick the next char in Π [i+ 1] , i.e. ‘D’ D B C 0 fr c • Rank-Select data structures on S last and S α � Search for the first and last ‘D’ in S α [fr,lr] D B C 1 a • Array C of | Σ | integers � Jump to their children D B C 1 b lr 2 occurrences of Π XBW-index Paolo Ferragina, Università di Pisa because of two 1s

  15. XBzipIndex : XBW + FM-index 60% 50% 40% 30% 20% 10% 0% DBLP Pathways News Huffword XPress XQzip XBzipIndex XBzip DBLP: 1.75 bytes/node , Pathways: 0.31 bytes/node , News: 3.91 bytes/node Upto 36% improvement in compression ratio Paolo Ferragina, Università di Pisa Query (counting) time ≅ 8 ms, Navigation time ≅ 3 ms

  16. The overall picture on Compressed Indexing... Data type This is a powerful paradigm to design compressed indexes: 1. Transform the input in few arrays (via BWT or XBW) Indexing 2. Index (+ Compress) the arrays to support rank/select ops [Kosaraju, Focs ‘89] Strong connection Compressed Indexing [I EEE Focs ’00] [I EEE Focs ’05] [J. ACM ’05] [WWW ’06] http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1) Paolo Ferragina, Università di Pisa Experimental: Wea ’06 (2)

Recommend


More recommend