full compressed affix tree representations
play

Full Compressed Affix Tree Representations L.I.R.M.M. Universit e - PowerPoint PPT Presentation

Full Compressed Affix Tree Representations L.I.R.M.M. Universit e de Montpellier Institut Biologie Computationnelle Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions &


  1. Full Compressed Affix Tree Representations L.I.R.M.M. Universit´ e de Montpellier Institut Biologie Computationnelle

  2. Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work

  3. Motivation Bidirectional Search Example: Harpins

  4. Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work

  5. Suffix Tree

  6. Suffix Tree Operations

  7. Suffix Arrays and Suffix Tree

  8. Suffix Arrays and Suffix Tree

  9. Burrows and Wheeler Transform (BWT)

  10. BWT: backward search backwardSearch(c, [ i , j ] ) : i ′ ← C [ c ] + Occ ( c , i − 1) + 1 j ′ ← C [ c ] + Occ ( c , j )

  11. Affix Tree ◮ Combines Suffix Tree of T with the Suffix Tree T r ◮ Introduced by Stoye (2000) and Maaß (2003) ◮ Problem: Complexity of the structures presented and that it uses about 45 n bytes

  12. Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work

  13. Asynchronous vs Synchronous ◮ Forward Structure ( FOS ) and the Backward Structure ( BAS )

  14. Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work

  15. Affix Array (AfA) ◮ Proposed by Strothmann (2007) ◮ Suffix Trees are stored using Suffix Arrays in addition with extra data ◮ Connections between the trees are also stored ( Affix links ) ◮ Does not support all tree operations ◮ Total: around 18–22 n bytes.

  16. Compressed Affix Tree (ACAT) ◮ Compressed Suffix Trees data structure ◮ Supports all tree operations ◮ Connections between the trees are also stored ( Affix links )

  17. Affix Link ALink ( v ) = Child ( Alink ( SLink ( v )) , c )

  18. Affix Link ALink ( v ) = Child ( Alink ( SLink ( v )) , c )

  19. Affix Link ALink ( v ) = Child ( Alink ( SLink ( v )) , c )

  20. Sampled Affix Link

  21. Compressed Affix Tree Sampled (ACATS) ◮ Compressed Suffix Trees data structure ◮ Sampled Affix links

  22. Compressed Affix Tree Non-Sampled ◮ Extreme case ACATS ◮ Albrecht and Heun (2012). Optimal computation of Affix links using binary search ◮ Gog et al. (2014). Faster solution (ACATN)

  23. ACATN

  24. ACATN

  25. RACATN

  26. Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work

  27. Bidirectional Wavelet Tree (BidWT) ◮ Proposed by Schnattinger et al. (2010 − 2012) and Lam et al. (2009) ◮ Uses backward index for the input text T and for T r ◮ Easy transition between the data structures ◮ Reduce space in a factor of 23 compared to the Affix Array ◮ Main operation: extend in one character

  28. Bidirectional Wavelet Tree

  29. Bidirectional Wavelet Tree

  30. Bidirectional Wavelet Tree

  31. SCAT

  32. SCAT

  33. Summary Approach Category Full Tree Description Operations Space Strothmann’s Affix Array AfA Asynchronous No 2 · ( SA + LCP + child tables + ALink ) Asynchronous Affix Tree implementation Yes ACAT Asynchronous 2 · ( CST + ALink ) Asynchronous Affix Tree implementation ACATS Asynchronous Yes 2 · ( CST + Alink sampled) Gog et al. Affix Tree ACATN Asynchronous Yes 2 · ( CST + rminq + rmaxq ) reduced of ACATN RACATN Asynchronous Yes 2 · ( CST + rminq ) Bidirectional BWT BidWT Synchronous No 2 · (FM-Index) Synchronous Affix Tree implementation SCAT Synchronous Yes 2 · ( CST ) Table: Compressed Affix Tree approaches studied in this work.

  34. Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work

  35. Construction DNA-50MB ENGLISH-50MB 1e+07 1e+07 Time in milliseconds Time in milliseconds 1e+06 1e+06 AFA AFA 100000 100000 ACAT ACAT ACATS ACATS ACATN ACATN RACATN RACATN BidWT BidWT SCAT SCAT 10000 10000 1 10 100 1 10 100 Number of bytes per character Number of bytes per character

  36. Forward-Backward DNA-50MB ENGLISH-50MB 100 100 AFA AFA ACAT ACAT ACATS ACATS ACATN ACATN Time in microseconds Time in microseconds RACATN RACATN 10 BidWT BidWT SCAT SCAT 10 1 0.1 1 1 10 100 1 10 100 Number of bytes per character Number of bytes per character

  37. Suffix-Children DNA-50MB ENGLISH-50MB 1000 1000 AFA AFA ACAT ACAT ACATS ACATS ACATN ACATN Time in microseconds Time in microseconds RACATN RACATN 100 BidWT 100 BidWT SCAT SCAT 10 10 1 1 1 10 100 1 10 100 Number of bytes per character Number of bytes per character

  38. Slink DNA-50MB ENGLISH-50MB 100 100 ACAT ACAT ACATS ACATS ACATN ACATN RACATN RACATN Time in microseconds Time in microseconds SCAT SCAT 10 10 1 1 10 100 1 10 100 Number of bytes per character Number of bytes per character

  39. Introduction Basic Concepts A Classification Asynchronous Approaches Synchronous Approaches Results Conclusions & Future Work

  40. Conclusions & Future Work ◮ Asynchronous and Synchronous classification ◮ Benchmark for the Compressed Affix Tree approaches ◮ Create a public library containing all the tools ◮ Still missing: pattern search with errors

Recommend


More recommend