Full Compressed Affix Tree Representations L.I.R.M.M. Universit´ e de Montpellier Institut Biologie Computationnelle
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Motivation Bidirectional Search Example: Harpins
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Suffix Tree
Suffix Tree Operations
Suffix Arrays and Suffix Tree
Suffix Arrays and Suffix Tree
Burrows and Wheeler Transform (BWT)
BWT: backward search backwardSearch(c, [ i , j ] ) : i ′ ← C [ c ] + Occ ( c , i − 1) + 1 j ′ ← C [ c ] + Occ ( c , j )
Affix Tree ◮ Combines Suffix Tree of T with the Suffix Tree T r ◮ Introduced by Stoye (2000) and Maaß (2003) ◮ Problem: Complexity of the structures presented and that it uses about 45 n bytes
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Asynchronous vs Synchronous ◮ Forward Structure ( FOS ) and the Backward Structure ( BAS )
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Affix Array (AfA) ◮ Proposed by Strothmann (2007) ◮ Suffix Trees are stored using Suffix Arrays in addition with extra data ◮ Connections between the trees are also stored ( Affix links ) ◮ Does not support all tree operations ◮ Total: around 18–22 n bytes.
Compressed Affix Tree (ACAT) ◮ Compressed Suffix Trees data structure ◮ Supports all tree operations ◮ Connections between the trees are also stored ( Affix links )
Affix Link ALink ( v ) = Child ( Alink ( SLink ( v )) , c )
Affix Link ALink ( v ) = Child ( Alink ( SLink ( v )) , c )
Affix Link ALink ( v ) = Child ( Alink ( SLink ( v )) , c )
Sampled Affix Link
Compressed Affix Tree Sampled (ACATS) ◮ Compressed Suffix Trees data structure ◮ Sampled Affix links
Compressed Affix Tree Non-Sampled ◮ Extreme case ACATS ◮ Albrecht and Heun (2012). Optimal computation of Affix links using binary search ◮ Gog et al. (2014). Faster solution (ACATN)
ACATN
ACATN
RACATN
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Bidirectional Wavelet Tree (BidWT) ◮ Proposed by Schnattinger et al. (2010 − 2012) and Lam et al. (2009) ◮ Uses backward index for the input text T and for T r ◮ Easy transition between the data structures ◮ Reduce space in a factor of 23 compared to the Affix Array ◮ Main operation: extend in one character
Bidirectional Wavelet Tree
Bidirectional Wavelet Tree
Bidirectional Wavelet Tree
SCAT
SCAT
Summary Approach Category Full Tree Description Operations Space Strothmann’s Affix Array AfA Asynchronous No 2 · ( SA + LCP + child tables + ALink ) Asynchronous Affix Tree implementation Yes ACAT Asynchronous 2 · ( CST + ALink ) Asynchronous Affix Tree implementation ACATS Asynchronous Yes 2 · ( CST + Alink sampled) Gog et al. Affix Tree ACATN Asynchronous Yes 2 · ( CST + rminq + rmaxq ) reduced of ACATN RACATN Asynchronous Yes 2 · ( CST + rminq ) Bidirectional BWT BidWT Synchronous No 2 · (FM-Index) Synchronous Affix Tree implementation SCAT Synchronous Yes 2 · ( CST ) Table: Compressed Affix Tree approaches studied in this work.
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Construction DNA-50MB ENGLISH-50MB 1e+07 1e+07 Time in milliseconds Time in milliseconds 1e+06 1e+06 AFA AFA 100000 100000 ACAT ACAT ACATS ACATS ACATN ACATN RACATN RACATN BidWT BidWT SCAT SCAT 10000 10000 1 10 100 1 10 100 Number of bytes per character Number of bytes per character
Forward-Backward DNA-50MB ENGLISH-50MB 100 100 AFA AFA ACAT ACAT ACATS ACATS ACATN ACATN Time in microseconds Time in microseconds RACATN RACATN 10 BidWT BidWT SCAT SCAT 10 1 0.1 1 1 10 100 1 10 100 Number of bytes per character Number of bytes per character
Suffix-Children DNA-50MB ENGLISH-50MB 1000 1000 AFA AFA ACAT ACAT ACATS ACATS ACATN ACATN Time in microseconds Time in microseconds RACATN RACATN 100 BidWT 100 BidWT SCAT SCAT 10 10 1 1 1 10 100 1 10 100 Number of bytes per character Number of bytes per character
Slink DNA-50MB ENGLISH-50MB 100 100 ACAT ACAT ACATS ACATS ACATN ACATN RACATN RACATN Time in microseconds Time in microseconds SCAT SCAT 10 10 1 1 10 100 1 10 100 Number of bytes per character Number of bytes per character
Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work
Conclusions & Future Work ◮ Asynchronous and Synchronous classification ◮ Benchmark for the Compressed Affix Tree approaches ◮ Create a public library containing all the tools ◮ Still missing: pattern search with errors
Recommend
More recommend