Comp omputation ional P Pan- an-Gen enomics omics wit with Ela last stic ic-D -Deg egen ener erate e Strin ings (a (a ca case se st study of of my my resea research rch) NAD NADIA IA PIS ISANT ANTI (this Department) 5/12/2019 PhD Day 1
The e pan-Gen enome ome Some definitions of pan- pan-gen enome ome: - ... describes the full complement of genes [...] which can have large variation in gene content among closely related strains [Wikipedia] - a collection of genomic sequences to be analyzed jointly or to be used as a reference [The Computational Pan-Genomics Consortium, 2016] Tradit ition ionall lly, , a ref referen erence ce gen enome ome is is: - a genome of a single selected individual, or - a consensus drawn from a population, or - a "functional" genome, or - a maximal genome capturing all ever- detected sequences - ... 5/12/2019 PhD Day 2
ED- ED-st strin ings Ela last stic ic Deg egen ener erate e st strin ing as as a a natur natural al rep represen resentation ion of of a pan-gen enome ome It cor It corresp respon onds to o the e Varia iant Call ll For Forma mat (. (.vcf cf) ) st standard rd [e.g. data from rom the e 1000 1000 Gen enomes omes project roject] 5/12/2019 PhD Day 3
Ref eferen erence ce Pan-Gen enome ome - Chea eaper er seq sequen encin cing: re- re-seq sequen encin cing beca ecame me a common common task sk. - In In gen enome ome analysis ysis wor orkflo flows ws, down wnst strea ream m of of re- re-seq sequen ences ces there ere is is the e task sk of of mappin ma ing rea reads (a (a st strin ing) ) on on a ref referen erence ce gen enome ome (a (a lon longer er st strin ing) It's PATTERN It N MATCHING ING: rea read is is P , ref , referen erence ce gen enome ome is is T T 5/12/2019 PhD Day 4
EDSM EDSM prob roblem lem ELASTIC DEGENERATE STRING MATCHING (EDSM) ~ Input : a string P of length m, an ED string T of length n and total size N ~ Output : all positions in T where at least one occurrence of P ends P = CGGGT = CGGGTATA 5/12/2019 PhD Day 5
Lower er bou ounds & & upper er bou ounds [IC ICAL ALP 2019] 2019] In [CPM 2017] we solved EDSM in O(N + n* m 2 ) time In [CPM 2018] they solve it in O(N + n* m 1.5 √(log m) ) time Can EDSM be improved further? In [ICALP 2019] we solve EDSM in O(N + n* m 1.381 ) time ... with an algebraic method! We show one can’t do better with combinatorial methods 5/12/2019 PhD Day 6
Patter ern Match chin ing on on ED-st strin ing wit with er error rors [SPIRE 2017] Rea eads ca carry seq sequen encin cing er error rors: : ho how ca can we we rep represen resent them em? Hammin mming Dist istance ce: : Giv iven en tw two st strin ings X X and Y Y on on the e sa same me alp lphabet et and and havin ing the e sa same me len length, the e Ham amming ng Dist stanc ance d H (X, (X,Y) Y) bet etween een X X and Y Y is is the e numb mber er of of posit osition ions s in in wh which ich they ey dif iffer er. . X X = CGGG GGTATA A d H (X, (X,Y)= Y)=2 Y Y = CAGG GGCATA A Edit Distance: Giv iven en tw two st strin ings X X and Y Y on on the e sa same me alp lphabet et, the e edit edit Dist stanc ance d E (X, (X,Y) Y) is is the e numb mber er of of su subst stit itution ions, , in inser sertion ions, or or delet eletion ion of of a let letter er need eeded ed to o transf sfor orm X X in into Y Y (or (or vicev icever ersa sa, as as d E (X, (X,Y)= Y)=d E (Y (Y,X)). X)). X = CGGG X GGTAT AT-- --A A d E (X, (X,Y)= Y)=3 Y Y = CCGG GG-- --AT ATTA A 5/12/2019 PhD Day 7
Deg egen ener erate e Strin ings Comp omparison ison STRING ING COMPAR ARIS ISON among (E)D-strings is a basic sic tool ool for or ma many ot other er prob roblems lems: Are two degenerate strings the same? Or similar? Or share sub-(E)D-strings? Motifs? Is one (E)D-string a substring of another (E)D-string? A Reverse? A Palindrome? 5/12/2019 PhD Day 8
Deg egen ener erate e Strin ings Comp omparison ison � our result [WABI 2018] A definition of a match among D-strings (a step into formal languages and automata problems) A linear (O(N+M)) algorithm to tell whether two D-strings X (of size N) and Y (of size N) do match (“accidentally” solving an open formal languages and automata problem) An application of such D-strings comparison to the design of two algorithms to decompose a D-string into palindromes (a proof-of-concept on real RNA data) 5/12/2019 PhD Day 9
Ref eferen erences ces The Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics 19(1): 118-135 (2018) R.Grossi, C.S.Iliopoulos, C.Liu, N.Pisanti, S.P. Pissis, A.Retha, G.Rosone, F.Vayani, L.Versari: On-Line Pattern Matching on Similar Texts. CPM 2017 : 9:1-9:14 G.Bernardini, N.Pisanti, S.P. Pissis, G.Rosone: Pattern Matching on Elastic-Degenerate Text with Errors. SPIRE 2017 : 74-90 [extended version in press in Theoretical Computer Science journal] M.Alzamel, L.A.K. Ayad, G.Bernardini, R.Grossi, C.S.Iliopoulos, N.Pisanti, S.P.Pissis, G.Rosone: Degenerate String Comparison and Applications. WABI 2018 : 21:1-21:14 G.Bernardini, P.Gawrychowski, N.Pisanti, S.P.Pissis, G.Rosone: Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication. ICALP 2019 : 21:1-21:15 5/12/2019 PhD Day 10
Recommend
More recommend