A hash algorithm for N3 graphs in CWM Work in progress Jes´ us Arias Fisteus jfisteus@csail.mit.edu, jaf@it.uc3m.es EX+ prosper Universidad Carlos III de Madrid Visiting scientist at the Decentralized Information Group at CSAIL–MIT T A Edited with emacs + L – This presentation: http://www.it.uc3m.es/jaf/mit/20060914/presentation.pdf Implementation: http://www.it.uc3m.es/jaf/mit/20060914/hash-n3.tar.gz : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 1 · · C I I A I R S O L
Goal Design a hash algorithm for N3 graphs such that: Equivalent graphs have the same hash value. Non equivalent graphs have (with high probability) different hash value For this work graphs are considered equivalent if: Have the same statements, with the same or different order. EX+ prosper Have the same variables / blank nodes, with the T A Edited with emacs + L same or different names. : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 2 · · C I I A I R S O L
Operators XOR ( ⊗ ) Commutative and associative Problem: a ⊗ a = 0 Product (modulus N ) Commutative and associative If N prime, ∄ a, b � = 0 / ab = 0 . N = 2 32 − 5 is the largest 32-bit prime. EX+ prosper Product and XOR combined: T A Edited with emacs + L ( ab ) ⊗ c � = ( a ⊗ c )( b ⊗ c ) ( a ⊗ b ) c � = ( ac ) ⊗ ( bc ) : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 3 · · C I I A I R S O L
Why two different operators Associativity and commutativity are not good sometimes: Example: { f 1 } = ⇒ { f 2 } hash ( f 1 ) = a ⊗ b hash ( f 2 ) = d ⊗ e hash ( = ⇒ ) = c hash ( { f 1 } = ⇒ { f 2 } ) = ( a ⊗ b ) ⊗ c ⊗ ( d ⊗ e ) EX+ prosper T A Edited with emacs + L ( a ⊗ b ) ⊗ c ⊗ ( d ⊗ e ) = ( a ⊗ e ) ⊗ c ⊗ ( d ⊗ b ) ( ab ) ⊗ c ⊗ ( de ) � = ( ae ) ⊗ c ⊗ ( db ) : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 4 · · C I I A I R S O L
Overview of the algorithm Recursive (when entering subformulae). Combines partial hashes of: formulae, statements (triples), variables, lists, labelled nodes, literals. Every statement / formula affects the hash value of the variables that appear in it and viceversa. EX+ prosper T A Edited with emacs + L : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 5 · · C I I A I R S O L
Hashing a formula 1. Hash every statement in the formula ( h s 1 , h s 2 , ..., h s n ). 2. Take the hash of every varible declared in the formula ( h v 1 , h v 2 , ..., h v m ) . 3. Combine them: h = h s 1 h s 2 ...h s n h v 1 h v 2 ...h v m . EX+ prosper T A Edited with emacs + L : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 6 · · C I I A I R S O L
Hashing a statement (triple) 1. The constants k s , k p , k o are pre–defined. 2. Hash the terms in its subject, predicate and object ( h s , h p , h o ). 3. Combine them: h = ( h s k s ) ⊗ ( h p k p ) ⊗ ( h o k o ) . EX+ prosper T A Edited with emacs + L : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 7 · · C I I A I R S O L
Hashing a term Labelled nodes: hash their URI (python’s hash function). Literals: hash them as strings (python’s hash function). Formulae: recursive. List: hash its member terms (recursion again). EX+ prosper h = ( h 1 ⊗ 1)( h 2 ⊗ 2) ... ( h n ⊗ n ) Anonymous variables: take their hash in the T A Edited with emacs + L previuous round (initially a constant, see later). : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 8 · · C I I A I R S O L
Hashing anonymous variables For each variable: 1. Initialize its hash with a constant: universal ( h = k v u ) or existential ( h = k v e ). 2. Recalculate a new hash h ′ from its previous hash h when it appears in position p (subject, predicate or object) of a statement (hash h t ): h ′ = h ⊗ ( h t k p ) . EX+ prosper 3. When the processing of a formula (hash h f ) finishes, if the variable has been used in it or T A Edited with emacs + L any inner formula and is declared also for the next upper formula, mix their hashes in the upper level: h ′′ = h ′ ( h ⊗ h f ) . : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 9 · · C I I A I R S O L
Example on hashing {?x test:partOf ?y. ?z test:includes ?y} => {?x test:partOf ?z} ?x test:partOf ?y ( k v u k s ) ⊗ ( h partof k p ) ⊗ ( k v u k o ) h 1 ?z test:includes ?y ( k v u k s ) ⊗ ( h includes k p ) ⊗ ( k v u k o ) h 2 ?x test:partOf ?z ( k v u k s ) ⊗ ( h partof k p ) ⊗ ( k v u k o ) h 3 {?x test:partOf ?y...} h f 1 h 1 h 2 EX+ prosper {?x test:partOf ?z} h f 2 h 3 T ?x A k v u (( h 1 k s ) ⊗ h f 1 )(( h 3 k s ) ⊗ h f 2 ) Edited with emacs + L h x ?y k v u (( h 1 k o ) ⊗ ( h 2 k o ) ⊗ h f 1 ) h y ?z k v u (( h 2 k s ) ⊗ h f 1 )(( h 3 k o ) ⊗ h f 2 ) h z : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 10 · · C I I A I R S O L
Example on hashing (cntd.) {?x test:partOf ?y. ?z test:includes ?y} => {?x test:partOf ?z} h = (( h f 1 k s ) ⊗ ( h implies k p ) ⊗ ( h f 2 k o )) h x h y h z EX+ prosper T A Edited with emacs + L : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 11 · · C I I A I R S O L
Conclusions on hashing Efficient algorithm. Seems to work well for comparing / indexing N3 formulae: Independent of the ordering of statements. Independent of the name of variables. Low probability of collision at formula level. EX+ prosper T A Edited with emacs + L : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 12 · · C I I A I R S O L
Canonicalization The canonicalization system has to decide: A canonical ordering for statements in the same formula. A canonical ordering for variables in the same formula. A canonical name for variables. Solution using the hash algorithm: EX+ prosper The hash of statements defines their ordering. T A Edited with emacs + L The hash of variables defines their ordering. The ordering of variables defines their name. : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 13 · · C I I A I R S O L
Drawbacks The canonical order is based on the hash value of statements / variables: If two statements in the same formula have the same hash, two different orderings are possible. If two variables have the same hash, two different naming relations are possible. Conclusion: collisions at statement / variable level EX+ prosper can provoke failures in canonicalization. T A Edited with emacs + L : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 14 · · C I I A I R S O L
Solution Run the hash algorithm three times: Initially the hash of variables is constant in the first step. In every step: The hash of statements is computed from the hash of variables in the previous level. The hash of variables is computed from the EX+ prosper hash of statements in the same level. T A Edited with emacs + L step 1 step 2 step 3 � �� � � �� � � �� � V 0 − → S 1 − → V 1 − → S 2 − → V 2 − → S 3 − → V 3 : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 15 · · C I I A I R S O L
Other problems and fixes Variables defined locally in two or more formulae that are exactly equal will collide. Solution: combine the hash of every variable with the hash of every parent formula of the formula in which the variable is declared. h ′ v = h v ⊗ ( h f 1 h f 2 ...h f n ) Variables declared but not used have a fixed hash EX+ prosper value and therefore all of them collide. T A Solution: remove such variables from the Edited with emacs + L canonicalized formula. : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 16 · · C I I A I R S O L
Implementation Features: Loads documents using the CWM parser. Calculates the hash value of the loaded formula. Canonicalizes the loaded formula. Writes the canonicalized formula. EX+ prosper T A Edited with emacs + L : U D N I I R V D E R A I I I S M I D A E D D A hash algorithm for N3 graphs in CWM – p. 17 · · C I I A I R S O L
Recommend
More recommend