Interactive HMM construction based on interesting sequences Szymon Jaroszewicz National Institute of Telecommunications Warsaw, Poland LeGo 2008 Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Overview Building models interactively based on interesting patterns Hidden Markov Models Interesting patterns w.r.t. Hidden Markov Models Experimental evaluation: web server log Conclusions and Future research Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Typical approach: Automatic model construction Or: Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Here: Interactive model construction Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Here: Interactive model construction + Understandable models + Learn while building models – Have to do ‘manual’ work :( Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Previous related work Scalable pattern mining with Bayesian networks as background knowledge S. Jaroszewicz, T. Scheffer, D. Simovici KDD’04, KDD’05, DMKD (to appear) Bayesian networks used as background model Exact and approximate algorithms given Models much closer to real relationships than automatically built models Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Hidden Markov Models (HMMs) Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Hidden Markov Models (HMMs) Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Hidden Markov Models (HMMs) User gives the structure of the HMM: internal states which transitions are possible (not probabilities) which emission symbols are possible for each state (not probabilities) Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Interestingness of sequences w.r.t. an HMM � � � Prob HMM { seq } − Prob Data { seq } Inter ( seq ) = � � � Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Algorithm for finding all ε -interesting sequences 1 Train HMM parameters based on Data (Baum-Welch) 2 Find all seq such that Prob Data { seq } > ε 3 Find all seq such that Prob HMM { seq } > ε 4 Compute Prob Data for seq frequent in HMM but not in Data 5 Compute Prob HMM for seq frequent in Data but not in HMM 6 Compute Inter ( seq ) for all sequences 7 Output ε -interesting sequences Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Inference in Hidden Markov Models Probability that sequence seq (starting at t = 0) is emitted and HMM ends in state s i α ( seq , s i ) Efficient recursive updating: α ( seq + o n +1 , s i ) = � α ( seq , s j ) P ji E io n +1 j Prob HMM { seq } = � i α ( seq , s i ) Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Finding frequent sequences in Hidden Markov Models Monotonicity property holds Prob HMM { seq + o } ≤ Prob HMM { seq } Standard depth-first frequent pattern mining works alpha probabilities used instead of support counting Very efficient: probability updating is fast Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Weblog of the National Institute of Telecommunications Web log format: 195.205.118.10 [01/Jan/2007:00:04:33 +0100] "GET / journal /paper 1.pdf" 200 8833 "http://www.google.pl/" 65.55.208.68 [01/Jan/2007:00:04:45] "GET / robots.txt " 200 51 "-" "msnbot/1.0" Preprocessing: keep only top level directory sessionizing Result: sessions: journal/, journal/, END robots.txt, index.html, journal/, ..., END exchweb/, exchange/, exchange/, ..., END ... Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Initial HMM Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
The Sophos antivirus Top sequences: sophos/,sophos/ Prob HMM = 1 . 17% Prob Data = 11 . 48% sophos/,sophos/,sophos/,sophos/ Prob HMM = 0 . 013% Prob Data = 9 . 29% Update of the Sophos antivirus Always accessed 2, 4 or more times Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
The Sophos antivirus: update to the model The new model is: Each soph state only emits the sophos/ symbol sophos/ symbol removed from all state Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Journal PDF files + icon Sequence: journals/, journals/, favicon.ico Prob HMM ≈ 0 Prob Data ≈ 2% favicon.ico small icon next to web address Default location: main directory At the Institute: img/ directory HTML header contains the other location; PDF can’t Browser tries the default location and fails Fixed: icon appears now Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Journal PDF files + icon Added the following segment to the model: The same PDF file often accessed twice; unable to explain: accelerators? browser errors? server errors? Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Other patterns Exchange mail web reader robots: Google / MSN / Yahoo RSS readers ... Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Final model Quickly built a model of high level user behavior Accuracy: probability of all sequences modeled with error < 0 . 01 Every sequence is either: uninteresting (modeled well) infrequent Understandability: the model is easily understandable Learnt a lot about the data while modeling Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Final model 0.277904 1 0.116693 sophos2a sophos2b 0.722096 0.741325 1 sophos4a sophos4b 0.258675 0.985953 0.0128473 sophos_more 0.00119974 0.416168 0.958403 czasopisma_2 0.365269 czasopisma_4 0.0415973 0.344828 0.218563 0.507599 0.335616 czasopisma_3 0.157534 0.655172 0.506849 0.0364742 0.174907 czasopisma_1 favico 0.455927 0.790598 0.209402 0.224771 confer_2 0.0579479 confer_1 0.775229 0.398524 0.82173 0.0720362 proxy_wpad_1 0.434023 0.0953639 proxy_wpad_2 0.0829057 0.167453 quit 0.00730688 0.992693 0.0572136 0.0280358 _all_sink 0.0102458 main_css 0.622558 0.909549 0.17918 0.00520123 0.45297 0.94003 0.000819665 main_js main_img 0.367851 0.105143 0.0489046 0.0667198 main 0.196602 0.0756972 0.0917065 robot_enter 0.486957 0.532609 0.55303 0.513043 0.0228602 coop 0.44697 robot_all_ 0.983928 0.467391 0.0217969 mail 0.0160721 0.0824373 0.917563 0.0680489 structure 0.568293 0.431707 0.68599 0.0619351 ogloszenia 0.820937 0.179063 0.31401 0.0550239 RSS_1 RSS_2 0.523659 0.386435 0.923644 0.190324 _all_ 0.0899054 0.0348292 0.0415271 _all_image Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Comparison with automatically learned models 20 hidden states + Baum Welch algorithm only transitions with prob. > 0 . 01 all transitions with prob. > 0 . 001 Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Only transitions with prob. > 0 . 01 0.803352 0.285567 0.576072 0.180548 exchweb:0.96 exchange:0.02 0.985168 0.0156184 0.0994718 0.0130245 exchange:0.99 ogloszenia:0.94 probniki:0.05 0.497353 0.0169881 0.0141684 0.0121043 exchweb:0.98 0.234342 exchange:0.02 0.383328 exchweb:0.97 0.448195 0.0540888 exchange:0.03 0.421178 0.733695 exchweb:1.00 0.161047 0.0101079 __END__:0.78 0.037825 index.html:0.17 img:0.03 0.976036 struk:0.37 p12:0.21 index.html:0.15 0.0109124 konf:0.15 kier:0.04 0.0122857 0.948961 0.0168552 prace:0.03 rada_n:0.01 oferta:0.46 wspolpraca:0.43 0.971186 0.0148372 __END__:0.03 img:0.03 0.0105685 favicon.ico:0.02 czasopisma:0.56 publ:0.15 RSS:0.12 0.0289194 0.933681 0.112007 __END__:0.08 favicon.ico:0.04 wydarzenia:0.03 0.988373 0.0279171 0.948481 exchange:0.97 public:0.01 0.0148548 cruise:0.01 exchange:0.85 0.0102352 exchweb:0.14 0.852759 0.690681 0.105043 0.0572462 wpad.dat:0.94 0.0441093 index.html:0.03 0.0186898 proxy.pac:0.02 0.0203733 sophos:1.00 0.135006 0.0269857 0.0207014 0.979299 0.944852 0.904394 sophos:0.96 0.11556 0.0152721 __END__:0.04 img:0.98 0.0316445 0.65772 0.261081 js:0.97 0.138782 0.0146485 ogloszenia:0.13 konf:0.11 struk:0.11 czasopisma:0.10 0.106825 robots.txt:0.10 0.955822 oferta:0.06 publ:0.05 css:0.93 RSS:0.05 0.0420723 0.624973 img:0.03 icton:0.04 publ:0.01 favicon.ico:0.03 index.html:0.03 0.860515 wpad.dat:0.03 wspolpraca:0.03 en:0.03 sieci:0.02 0.0395215 p12:0.02 js:0.95 publ:0.02 Szymon Jaroszewicz Interactive HMM construction based on interesting sequences
Recommend
More recommend