detection of unusual words detection of unusual words
play

Detection of unusual words Detection of unusual words GIVEN - PDF document

Stefano Lonardi March, 2000 Monotony of Surprise and Monotony of Surprise and Large- Large -Scale Quest for Scale Quest for Unusual Words Unusual Words Stefano Lonardi Lonardi Stefano U niver s it y of Cal if or nia, R iver s ide U


  1. Stefano Lonardi March, 2000 Monotony of Surprise and Monotony of Surprise and Large- Large -Scale Quest for Scale Quest for Unusual Words Unusual Words Stefano Lonardi Lonardi Stefano U niver s it y of Cal if or nia, R iver s ide U niver s it y of Cal if or nia, R iver s ide joint work with joint work with A. Apostolico, M. E. Bock, A. Apostolico, M. E. Bock, F. Gong F. Gong Detection of unusual words Detection of unusual words � GIVEN � GIVEN – a text a text x x – – a probabilistic a probabilistic model model of the source of the source – which has generated x which has generated x � FIND FIND all the substrings of all the substrings of x x which are which are � significantly more fr equent/ rare significantly more fr equent/ rare than than the model- -based expectation based expectation the model Data Compression Conference 2000

  2. Stefano Lonardi March, 2000 Example Example parameters parameters … …AT ATGACAAGTCCTAAAAAGAGCGAAAACACAGGGTTGTTTGATTGTAGAAAATCACAGCG GACAAGTCCTAAAAAGAGCGAAAACACAGGGTTGTTTGATTGTAGAAAATCACAGCG >MEK1 >MEK1 CCACCCTTTTGTGGGGCTTCTATTTCAAGGACCTTCATTATGGAAACAGGGCGAGGTTGT CCACCCTTTTGTGGGGCTTCTATTTCAAGGACCTTCATTATGGAAACAGGGCGAGGTTGT TTGTTCTTCCTGCATGTTGCGCGCAGTGCGTAAGAAAGCGGGACGTAAGCAGTTTAGCCA TTGTTCTTCCTGCATGTTGCGCGCAGTGCGTAAGAAAGCGGGACGTAAGCAGTTTAGCCA TTCTAAAAGGGGCATTATCAGAATAAGAAGGCCCTATGAGGTATGATTGTAAAGCAAGTG TTCTAAAAGGGGCATTATCAGAATAAGAAGGCCCTATGAGGTATGATTGTAAAGCAAGTG MODEL MODEL MODEL GTGTAAAATTGTGTGCTACCTACCGTATTAGTAGGAACAATTATGCAAGAGGGGTCCTGT GTGTAAAATTGTGTGCTACCTACCGTATTAGTAGGAACAATTATGCAAGAGGGGTCCTGT GCAAATAAAAAATATATATCTAGAAAAAGAGTAGGTAGGTCCTTCACAATATTGACTGAT GCAAATAAAAAATATATATCTAGAAAAAGAGTAGGTAGGTCCTTCACAATATTGACTGAT AGCGATCTCCTCACTATTTTTCACTTATATGCAGTATATTTGTCTGCTTATCTTTCATTA AGCGATCTCCTCACTATTTTTCACTTATATGCAGTATATTTGTCTGCTTATCTTTCATTA AGTGGAATCATTTGTAGTTTATTCCTACTTTATGGGTATTTTCCAATCATAAAGCATACC AGTGGAATCATTTGTAGTTTATTCCTACTTTATGGGTATTTTCCAATCATAAAGCATACC GTGGTAATTTAGCCGGGGAAAAGAAGAATGAT GTGGTAATTTAGCCGGGGAAAAGAAGAATGATGGCGGC GGCGGCTAAATTTC TAAATTTCGGCGGC GGCGGC … … ? ? Transcription factors binding sites Transcription factors binding sites Data Compression Conference 2000

  3. Stefano Lonardi March, 2000 Transcription factors binding sites Transcription factors binding sites Co- -expressed genes expressed genes Co … … Putative binding sites Pattern discovery Pattern discovery General framework General framework Which patterns do we count? What do we expect, under the given model? What is unusual? How do we count efficiently? How many patterns can be unusual? How do we compute statistical parameters efficiently? Data Compression Conference 2000

  4. Stefano Lonardi March, 2000 Notations Notations = x :sequence, x n = y :substring of , x y m f y ( ) : number of o ccurrences of in y x Bernoulli model Bernoulli model Let Z be a r.v. for the number of occurrences of , y y ∈Σ = ≤ + p be the probability of a , and y m ( n 1) 2 a m ∏ = − + = − + i ˆ ( E Z ) ( n m 1) p ( n m 1) p y y [ ] i = i 1 = − − − + − + i 2 ˆ ˆ ˆ Var Z ( ) E Z ( )(1 p ) p ( n m 1)( n m ) 2 pB y ( ) y y m ∑ ∏ = − + − where ( ) B y ( n m 1 d ) p y [ i ] ∈ = + d P y ( ) i m d - 1 and ( ) is the set of period lengths of P y y Data Compression Conference 2000

  5. Stefano Lonardi March, 2000 Scores Scores = − z ( ) y f y ( ) E Z ( ) 1 y − f y ( ) E Z ( ) = y z ( ) y 2 E Z ( ) y − ( ) ( ) f y E Z = y z ( ) y 3 − ˆ E Z ( ) (1 p ) y − ( ) ( ) f y E Z = y z ( ) y 4 Var Z ( ) y where Z is a r.v. for the number of occurrences of y y What is “unusual” ? What is “unusual” ? Definition ∈ R + Let be a substring of and y x T > i if ( ) z y T , then is y over-represented < − i i f ( ) z y T , then is y under-represen ted > i if z y ( ) T , then is y u nusual Data Compression Conference 2000

  6. Stefano Lonardi March, 2000 Problem definition Problem definition Given Given � Sequence Sequence x x � � Model � Model M M � Type of count ( Type of count ( f, f, …) …) � � � Score function Score function z z � Threshold Threshold T T � Find Find � The set of all unusual words in The set of all unusual words in x x � w.r.t. w.r.t. (f/ …,z,M,T) (f/ …,z,M,T) Computational problems Computational problems � Counting “events” in strings Counting “events” in strings � (occurrences, …) (occurrences, …) � � Computing expectations, variances, Computing expectations, variances, and scores (under the given model) and scores (under the given model) � Detecting and visualizing unusual Detecting and visualizing unusual � words words Data Compression Conference 2000

  7. Stefano Lonardi March, 2000 Combinatorial problem Combinatorial problem � � A sequence of size A sequence of size n n could have could have 2 ) O(n (n 2 ) unusual words unusual words O � How to limit the set of unusual How to limit the set of unusual � words? words? Monotony of surprise Monotony of surprise Data Compression Conference 2000

  8. Stefano Lonardi March, 2000 Theorem Let C be a subset of words from text . If ( ) remains x f y constant for all in y C , then any score of the type − f y ( ) E y ( ) = ( ) z y N y ( ) is monotonically increasing with y provided t h at i N y ( ) is monotonically decreasing with y i E y ( ) N y ( ) is monotonically decre asin g with y Theorem Score functions = − ( ) ( ) ( ) z y f y E Z y − f y ( ) E Z ( ) = y ( ) z y E Z ( ) y − f y ( ) E Z ( ) = y ( ) z y − ˆ E Z ( )(1 p ) y are monotonically increasin g with y , for a ll in c y lass C Data Compression Conference 2000

  9. Stefano Lonardi March, 2000 Theorem { } < − y If p min 1 4 y , 2 1 , then max − f y ( ) E Z ( ) = y ( ) z y Var Z ( ) y is monotonically increasin g with y , for all in class y C Building the partition Building the partition Data Compression Conference 2000

  10. Stefano Lonardi March, 2000 abaababaabaababaababa abaababaabaababaababa abaa aababaab babaabaa aababaababa babaababa ab aa aa aa aa Data Compression Conference 2000 10

  11. Stefano Lonardi March, 2000 a abaa baababaa babaabaa baababaababa babaababa baa baa baa baa abaababa babaabaa abaababaababa babaababa abaa abaa abaa abaa abaa Data Compression Conference 2000 11

  12. Stefano Lonardi March, 2000 abaab abaababa abaabaab abaababaababa abaababa abaab abaab abaab abaab abaababa baabaaba abaababaababa baababa abaaba abaaba abaaba abaaba abaaba Data Compression Conference 2000 12

  13. Stefano Lonardi March, 2000 abaaba abaababa baabaaba abaababaababa baababa abaaba abaaba abaaba abaaba max(C): candidate over-repr abaaba abaab abaa baa baab baaba aa aab aaba min(C): candidate under-repr Data Compression Conference 2000 13

  14. Stefano Lonardi March, 2000 = abaababaabaababaababa abaababaabaababaababa x = x a a b b babaa babaa ……… ……… (13) (13) ba ba babaab babaab ……… ……… ab babaaba ……… ab babaaba ……… aba aba ababaa ababaa ……… ……… aa aa (8) (8) ababaab ababaab ……… ……… aab aab ababaaba ……… ababaaba ……… aaba aaba aababaa ……… aababaa ……… baa baa bab bab aababaab aababaab ……… ……… baab baab baba baba aababaaba aababaaba ……… ……… baaba baaba abab abab baababaa ……… baababaa ……… abaa ababa abaa ababa baababaab baababaab ……… ……… abaab abaab ababb ababb baababaaba baababaaba ……… ……… abaaba abaaba aababa aababa abaababaa abaababaa ……… ……… (4) (4) baabab baabab abaababaab ……… abaababaab ……… baababa baababa abaababaaba abaababaaba ……… ……… (3) (3) (2) (2) (1) (1) k b k a k b k = a x = x a k a k b … a k b k … … ab k ab … b k Data Compression Conference 2000 14

  15. Stefano Lonardi March, 2000 { } … The partition C C , , , C of the set 1 2 l of all substrings of , has to satisfy the x following properties ( ) ( ) i min C and max C are unique i i i all in w C belong to some i ( ) ( ) ( ) min C ,max C -path i i i all in w C have the same co unt i ≤ ≤ for all 1 i l . Suffix trees Suffix trees � Suffix trees can be built in Suffix trees can be built in O(n) O(n) time time � and space [W73,M76,U95,F97] [W73,M76,U95,F97] and space � Number of occurrences can be Number of occurrences can be � computed in O(n) computed in O(n) time time Data Compression Conference 2000 15

  16. Stefano Lonardi March, 2000 Finding equivalence classes Finding equivalence classes <g <f a 1 a 2 a l w c 1 =g .... =f T 1 c 2 a 1 c 2 c 1 w a 2 a l c 3 =g .... =f T 2 right extension ... ... ... w a 1 a 2 a l =g .... =f T h >g >f left extension proper loci edges suffx links improper loci Finding equivalence classes Finding equivalence classes c 3 c 2 c 1 wa 1 c 3 c 2 c 1 wa 1 a 2 c 3 c 2 c 1 wa 1 a 2 ... a l .... c 2 c 1 wa 1 c 2 c 1 wa 1 a 2 c 2 c 1 wa 1 a 2 ... a l .... ... ... ... .... wa 1 a 2 wa 1 a 2 ... a l wa 1 Data Compression Conference 2000 16

Recommend


More recommend