General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA: Algorithmic Information Theory and Universal Classification for Sequences SeqBio 2018, Rouen, France François Cayre, Nicolas Le Bihan and Marion Revolle GIPSA-Lab | DIS | CICS November 19 th , 2018 1 / 34
General information measures SALZA similarity as information Definitions Applications of SALZA Examples Parallel implementation Axioms for measuring information Definition (General information measure [Steudel et al., 2010]) Let X be a set of discrete-valued r.v., Ω = 2 X be the set of subsets, (Ω , ∧ , ∨ ) be a finite lattice and 0 be the meet of all elements. R : Ω → R is an information measure if it satisfies : Normalization : R ( 0 ) = 0 ; 1 Monotonicity : ∀ s , t ∈ Ω , s ≤ t = ⇒ R ( s ) ≤ R ( t ) ; 2 Submodularity : ∀ s , t ∈ Ω , R ( s )+ R ( t ) ≥ R ( s ∨ t )+ R ( s ∧ t ) . 3 Definition (Conditional mutual information [Steudel et al., 2010]) ∀ s , t , u ∈ Ω , I ( s : t | u ) = R ( s ∨ u )+ R ( t ∨ u ) − R ( s ∨ t ∨ u ) − R ( u ) . s and t are said to be independent given u if I ( s : t | u ) = 0. 2 / 34
General information measures SALZA similarity as information Definitions Applications of SALZA Examples Parallel implementation Deriving information theory Lemma (Non-negativity of mutual information and conditioning [Steudel et al., 2010]) ∀ s , t , u ∈ Ω , the following hold : 0 ≤ I ( s : t | u ) ; 1 0 ≤ I ( s | t , u ) ≤ I ( s | t ) . 2 Lemma (Chain rule [Steudel et al., 2010]) ∀ s , t , u , x ∈ Ω , I ( s : t ∨ u | x ) = I ( s : t | x )+ I ( s : u | t , x ) . Lemma (Data processing inequality [Steudel et al., 2010]) ∀ s , t , x ∈ Ω , R ( s | t ) = 0 = ⇒ I ( s : x | t ) = 0 = ⇒ I ( s : x ) ≤ I ( t : x ) . 3 / 34
General information measures SALZA similarity as information Definitions Applications of SALZA Examples Parallel implementation Examples of information measures [Steudel et al., 2010] Common examples Shannon entropy of r.v. ; Kolmogorov complexity of binary strings ; Period length of time series ; Size of vocabulary in a text. Complexity/Compression-based Lempel-Ziv complexity (LZ76) ; Grammar-based compression ; LZ77 ? Ziv-Merhav ? (now, that’s a cliffhanger !) 4 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice Sequences Definition (Sequences) A sequence x is defined as a finite succession of symbols drawn from a countable alphabet A . Let | x | be the length of the sequence x , the empty sequence is / 0 . A + is the set of all non-empty sequences and A ⋆ = / 0 ∪ A + . In a set of n sequences x 1 ,..., x n , the first k sequences are denoted by x ≤ k and x ≤ 0 = / 0 . 5 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations Definition (Prior knowledge R and factorizations) Given sequences y , x 1 ,..., x n ∈ A ⋆ , the notation y ≀ x 1 ,..., x n stands for the generic case and denotes any of the following canonical factorizations : y | x 1 ,..., x n : R is the past of y and the entirety of x 1 ,..., x n 1 → LZ77-based factorization ; y | + x 1 ,..., x n : R is the entirety of x 1 ,..., x n 2 → Ziv-Merhav-based factorization. 6 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations in picture y Past (already factorized) To be factorized x 1 x 2 x 3 . . . . . x n 7 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations in picture R for y | x 1 ,..., x n (LZ77) y Past (already factorized) To be factorized x 1 x 2 x 3 . . . . . x n 7 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations in picture R for y | x 1 ,..., x n (LZ77) R for y | + x 1 ,..., x n (Ziv-Merhav) y Past (already factorized) To be factorized x 1 x 2 x 3 . . . . . x n 7 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA symbols (and lengths) Definition (SALZA symbols ( s , l , z ) and their lengths L y ≀ x 1 ,..., x n ) By always finding the next longest subsequence in R , SALZA computes a factorization of y into m symbols ( s i , l i , z i ) 1 ≤ i ≤ m : y ≀ x 1 ,..., x n = ( s 1 , l 1 , z 1 ) ... ( s m , l m , z m ) . Literals : s = y , l = 1 and z is the symbol in A that should be copied to the output buffer ; References : l > 1 is the length of a subsequence in R . SALZA symbol lengths are collected into : L y ≀ x 1 ,..., x n = { l i } 1 ≤ i ≤ m . 8 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice Product of SALZA factorizations Definition (Product of SALZA factorizations) Let y 1 and y 2 two sequences being factorized, each with respective prior knowledge sequences x 1 , 1 ,..., x 1 , n 1 and x 2 , 1 ,..., x 2 , n 2 . Let also : y 1 ≀ x 1 , 1 ,..., x 1 , n 1 = ( s 1 , 1 , l 1 , 1 , z 1 , 1 ) ... ( s 1 , m 1 , l 1 , m 1 , z 1 , m 1 ) , and y 2 ≀ x 2 , 1 ,..., x 2 , n 2 = ( s 2 , 1 , l 2 , 1 , z 2 , 1 ) ... ( s 2 , m 2 , l 2 , m 2 , z 2 , m 2 ) . We define their factorization product as the concatenation of their SALZA symbols : y 1 ≀ x 1 , 1 ,..., x 1 , n 1 × y 2 ≀ x 2 , 1 ,..., x 2 , n 2 = ( s 1 , 1 , l 1 , 1 , z 1 , 1 ) ... ( s 1 , m 1 , l 1 , m 1 , z 1 , m 1 ) ( s 2 , 1 , l 2 , 1 , z 2 , 1 ) ... ( s 2 , m 2 , l 2 , m 2 , z 2 , m 2 ) . 9 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA joint and LZ77 factorizations Definition (SALZA joint and LZ77 factorizations) 0 . The joint factorization of x 1 ,..., x n ∈ A ⋆ is By convention, set x ≤ 0 = / defined as the following product of factorizations : n ∏ x 1 ... · x n = x i ≀ x ≤ i − 1 . i = 1 Hence, x 1 | / 0 denotes the usual LZ77 factorization of x 1 . Moreover, x 1 | + / 0 denotes the succession of symbols forming x 1 . On asymmetry Note that in general, x · y � = y · x . On sequences, we are limited ( ?) to asymmetric relationships, see [Steudel et al., 2010]. 10 / 34
SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice Rationale Noisy-stemming hypothesis [Cancedda et al., 2003] “Multiple word matching really does occur and is beneficial in forming discriminant, high weight features.” Sequence compressibility [Raskhnodnikova et al., 2013] Compressibility of a sequence using LZ77 is an inverse function of its ℓ -th subword complexity, for small ℓ . The higher the number of small subsequences to be compressed (noise), the lower the discriminative power using compressors. Morphological normalization in SALZA We shall penalize small subsequence lengths in the factorizations. 11 / 34
Recommend
More recommend