similarity and correction of strings and trees towards a
play

Similarity and Correction of Strings and Trees : Towards a - PowerPoint PPT Presentation

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Universit-Franois Rabelais de Tours, Campus de Blois, Laboratoire dInformatique Seminarium IPIPAN, 24 kwietnia, 2006 String-to-string


  1. Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006

  2. String-to-string correction 2

  3. T raditional string-to-string correction (Wagner&Fischer 1974, Lawrence&Wagner 1975,…) • CONTEXT: – Finite set of symbols ( alphabet ) – Elementary operations on symbols ( editing operations , e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) – Sequences of editing operations ( edit sequences ; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) – Measure of similarity between words A and B ( edit distance or error distance ): minimum cost of all edit sequences transforming A to B • INPUT: – Two words A and B • OUTPUT: – Distance between A and B 3

  4. Examples of elementary edit operations • Insertion of a letter monter  montaer, monter  montrer • Deletion of a letter monter  montr, monter  monte • Replacement of a letter by another monter  ponter, monter  conter • Transposition of two adjacent letters monter  mnoter, monter  montre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation. 4

  5. Edit sequence • Edit sequence = sequence of elementary edit operations • For each couple of words X and Y many edit sequences exist that transform X into Y. • Example 1: transforming sorting into string : Linear sequence – sorting  srting  sting  string (3 operations) – sorting  sotring  string (2 operations) Linear sequence – sorting  srting  string (2 operations) – sorting  strting  string (2 operations) Linear sequence – sorting  srting  sting  sing  sring  string (5 operations) – ................. • Example 2: transforming abc into ca : – abc  ac  ca (2 operations) Linear sequence – abc  cabc  cac  ca (3 operations) • From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. 5

  6. Edit (error) distance • Cost of an edit sequence = sum of costs of all elementary operations included in the sequence – sorting  srting  sting  string (3 operations)  cost = 3 – sorting  sotring  string (2 operations)  cost = 2 – sorting  srting  sting  sing  sring  string (5 operations)  cost = 5 • Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account 6

  7. Calculating the edit distance (1/4) Notation : word X= x 1 x 2 ... x i ...x n ; the prefix of lenght i of X : X[i] = x 1 x 2 ... x i i x 1 x 2 x 3 ... x i ... x n X X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases i X[i+1] If x i+1 = y j+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) Y[j+1] j 7

  8. Calculating the edit distance (2/4) If x i = y j+1 and x i+1 = y j (the 2 last characters may be i inverted) then 4 sub-cases are possible: X[i+1] Transposition’s • The cheapest sequence transforming X[i+1] cost into Y[j+1] contains a transposition of x i and x i+1 : Y[j+1] ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 j • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : Replacement’s cost ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’ insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 Insertion’s cost • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Deletion’s cost 8

  9. Calculating the edit distance (3/4) i X[i+1] OTHERWISE (if x i+1  y j+1 , and ( x i  y j+1 or x i+1  y j )) then 3 sub-cases are possible: Y[j+1] j • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : Replacement’s cost ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 Insertion’s cost • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Deletion’s cost 9

  10. Calculating the edit distance (4/4) Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if x i+1 = y j+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if x i =y j+1 et x i+1 = y j 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])} 10

  11. Calculation the edit distance : dynamic programming case [i,j] contains the edit j m distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word  s o r t i n g  0 1 2 3 4 5 6 7 s 1 0 1 2 3 4 5 6 i t 2 1 1 2 2 3 4 5 r 3 2 2 1 2 3 4 5 i 4 3 3 2 3 2 3 4 n 5 4 4 3 4 3 2 3 g 6 5 5 4 5 4 3 2 n case [n,m] contains the edit 11 distance between the 2 words

  12. Dynamic programming: case 1 j+1 x i+1 = y j+1  s o r t i n g  0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? i+1 t 2 1 1 2 2 ? ? ? r ? ? ? ? ? ? ? ? i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 12

  13. Dynamic programming : case 2 j+1 x i+1 = y j and x i+1 = y j  s o r t i n g  0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i+1 i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 13

  14. Dynamic programming : case 3 j+1 x i+1  y j+1 et (x i+1  y j ou x i+1  y j )  s o r t i n g  0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i 4 3 3 2 2 ? ? ? i+1 n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 14

  15. String-to-language correction 15

  16. String-to-language correction: problem defjnition • CONTEXT: – Finite set of symbols ( alphabet ) – Elementary edit operations on symbols (as before) with their costs (1 per operation) – Edit sequences (as before) – Edit distance ( error distance ) between words: as before • INPUT: – Regular grammar describing words (a finite set of words in particular) – Incorrect word A (unrecognizable by the grammar) – Threshold t • OUTPUT: – A set of correct words B 1 , B 2 , …, B n whose distance from A stays within t (the nearest neighbors of A) 16

  17. String-to-language correction: simplistic approach • METHOD: – For each word B recognizable by the grammar calculate the edit distance matrix between A and B. – Propose candidates whose distance from A does not exceed the threshold t (ed(A,B)  t). • FAISABILITY: – Impossible in case of infinite languages • COMPLEXITY: O(n * m * |D|) 17

  18. String-to-language correction: threshold-controlled depth-fjrst exploration of an FSA (Ofmazer 1996, …) 18

  19. String correction with respect to a deterministic FSA (1/4) e Word to be corrected : *aply, threshold 2 2 4 p p 8 a s e Part of the matrix 9 calculated only once for 1 5 7 y a l all valid words sharing p the same prefix appl y 3 6 l  a p p l ... ... e • Each time a transition is followed a  0 1 2 3 4 ... ... 5 new column is calculated in the edit distance matrix a 1 0 1 2 3 ... 4 ... • If we get to a final state and the edit p 2 1 0 1 2 ... 3 ... distance remains within the thershold  a new candidate has been found l 3 2 1 1 1 ... 2 ... 2 y 4 3 2 2 2 ... ... apple 19

  20. String correction with respect to a deterministic FSA (2/4) e Word to be corrected : *aply, threshold 2 2 4 p p 8 a s e Part of the matrix 9 calculated only once for 1 5 7 y a l all valid words sharing p the same prefix appl y 3 6 l  a p p l ... ... e s • Each time a transition is followed a  0 1 2 3 4 ... ... 5 6 new column is calculated in the edit distance matrix a 1 0 1 2 3 ... 4 ... 5 • If we get to a final state and the edit p 2 1 0 1 2 ... 3 ... 4 distance remains within the thershold  a new candidate has been found l 3 2 1 1 1 ... 2 ... 3 2 3 y 4 3 2 2 2 ... ... apple 20

Recommend


More recommend