space efficient construction of succinct de bruijn graphs
play

Space-efficient construction of succinct de Bruijn graphs Felipe A. - PowerPoint PPT Presentation

Space-efficient construction of succinct de Bruijn graphs Felipe A. Louza University of S ao Paulo, Brazil Joint work with Lavinia Egidi and Giovanni Manzini. LSD/LAW London, 6-7 Feb. 2019 Outline 1. Introduction 2. BOSS construction 3.


  1. Space-efficient construction of succinct de Bruijn graphs Felipe A. Louza University of S˜ ao Paulo, Brazil Joint work with Lavinia Egidi and Giovanni Manzini. LSD/LAW London, 6-7 Feb. 2019

  2. Outline 1. Introduction 2. BOSS construction 3. Merging dBGs 4. Space-efficient BOSS construction 5. References Felipe A. Louza (USP) Space-efficient construction of dBGs 2 / 21

  3. de Bruijn graphs (dBGs) Definitions: ◮ Given a collection of strings S , a de Bruijn graph of order k is a directed graph containing: ◮ a node v for every unique k -mer v [1] ... v [ k ] in S . ◮ an edge ( u , v ) with label v [ k ] if there is a ( k + 1) -mer u [1] ... u [ k ] v [ k ] in S . Example: ◮ S = { TACACT, TACTCA, GACTCG } C ACA CAC TCG A T G T C TAC ACT CTC T A GAC TCA Felipe A. Louza (USP) Space-efficient construction of dBGs 3 / 21

  4. de Bruijn graphs (dBGs) Definitions: ◮ Given a collection of strings S , a de Bruijn graph of order k is a directed graph containing: ◮ a node v for every unique k -mer v [1] ... v [ k ] in S . ◮ an edge ( u , v ) with label v [ k ] if there is a ( k + 1) -mer u [1] ... u [ k ] v [ k ] in S . Example: ◮ S = { TACACT, TACTCA, GACTCG } C ACA CAC TCG A T G T C TAC ACT CTC T A GAC TCA Felipe A. Louza (USP) Space-efficient construction of dBGs 3 / 21

  5. de Bruijn graphs (dBGs) Definitions: ◮ Given a collection of strings S , a de Bruijn graph of order k is a directed graph containing: ◮ a node v for every unique k -mer v [1] ... v [ k ] in S . ◮ an edge ( u , v ) with label v [ k ] if there is a ( k + 1) -mer u [1] ... u [ k ] v [ k ] in S . Example: ◮ S = { TACACT, TACTCA, GACTCG } C ACA CAC TCG A T G T C TAC ACT CTC T A GAC TCA Felipe A. Louza (USP) Space-efficient construction of dBGs 3 / 21

  6. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { TACACT, TACTCA, GACTCG } T C $ ACA CAC TCG $$$ $$T · · · A A T G C T C G $TA TAC ACT CTC T A $ A C $$G $GA GAC TCA · · · ∗ for the authors’ initials Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  7. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $$$ $$T ACA CAC TCG · · · A A T G C T C $TA TAC ACT CTC G T A A C $$G $GA GAC TCA For convenience, we add k copies of a symbol $ at the beginning of each string s i . Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  8. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $$$ $$T ACA CAC TCG · · · A A T G C T C $TA TAC ACT CTC G T A A C $$G $GA GAC TCA The label of every node can be recovered. Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  9. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $ ACA CAC TCG $$$ $$T · · · A A T G C T C G $TA TAC ACT CTC T A $ A C $$G $GA GAC TCA · · · Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  10. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $ ACA CAC TCG $$$ $$T · · · A A T G C T C G $TA TAC ACT CTC T A $ A C $$G $GA GAC TCA · · · Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  11. Succinct representation of dBGs: BOSS: ◮ Nodes v i = v i [1] ... v i [ k ] are sorted by their reversed labels ← − v i = v i [ k ] ... v i [1] ◮ We mark the position of the last outgoing edge of each node. ◮ We mark as negative ( − ) incoming edges with the same label (except the first). last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Felipe A. Louza (USP) Space-efficient construction of dBGs 5 / 21

  12. Succinct representation of dBGs: BOSS: ◮ Nodes v i = v i [1] ... v i [ k ] are sorted by their reversed labels ← − v i = v i [ k ] ... v i [1] ◮ We mark the position of the last outgoing edge of each node. ◮ We mark as negative ( − ) incoming edges with the same label (except the first). last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Felipe A. Louza (USP) Space-efficient construction of dBGs 5 / 21

  13. Succinct representation of dBGs: BOSS: ◮ Nodes v i = v i [1] ... v i [ k ] are sorted by their reversed labels ← − v i = v i [ k ] ... v i [1] ◮ We mark the position of the last outgoing edge of each node. ◮ We mark as negative ( − ) incoming edges with the same label (except the first). last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Felipe A. Louza (USP) Space-efficient construction of dBGs 5 / 21

  14. Succinct representation of dBGs: BOSS: ◮ LF-mapping between the positive symbols in W and the Nodes [ k ] (with last = 1). ◮ Fast navigation operations: Outdegree, Outgoing, Indegree and Incoming. ◮ Small space: O ( m log σ ) + m + o ( m ) bits for rank and select operations. last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Similar to the BWT and XBW. Felipe A. Louza (USP) Space-efficient construction of dBGs 6 / 21

  15. Succinct representation of dBGs: BOSS: ◮ LF-mapping between the positive symbols in W and the Nodes [ k ] (with last = 1). ◮ Fast navigation operations: Outdegree, Outgoing, Indegree and Incoming. ◮ Small space: O ( m log σ ) + m + o ( m ) bits for rank and select operations. last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Select operation. Felipe A. Louza (USP) Space-efficient construction of dBGs 6 / 21

  16. Succinct representation of dBGs: BOSS: ◮ LF-mapping between the positive symbols in W and the Nodes [ k ] (with last = 1). ◮ Fast navigation operations: Outdegree, Outgoing, Indegree and Incoming. ◮ Small space: O ( m log σ ) + m + o ( m ) bits for rank and select operations. last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Select operation. Felipe A. Louza (USP) Space-efficient construction of dBGs 6 / 21

Recommend


More recommend