✎ ✍ Sequence Data Mining: Techniques and Applications Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita �✂✁☎✄ ✁✝✆✞✁✠✟☛✡ What is a sequence? • Ordered set of elements: s = a ☞ ,a ✌ ,..a • Each a could be – Categorical: domain a finite set of symbols Σ , | Σ |=m – Numerical – Multiple attributes • The length n of a sequence is not fixed • Order determined by time or position and could be regular or irregular �✂✁☎✄ ✁✝✆✞✁✠✟☛✡ 1
Motivation • Several real-life mining applications on sequence data • Classical applications – Speech, language, handwritten are all complex sequences • Newer applications – Bio-informatics: DNA and proteins – Telecommunication: Network alarms, network packet data – Retail data mining: Customer behavior ✏✂✑☎✒ ✑✝✓✞✑✠✔☛✕ Outline • Three case studies – Intrusion detection – Information Extraction – Bio-informatics: protein classification • Sequence mining operators • Approaches to sequence mining • Conclusions and future work ✏✂✑☎✒ ✑✝✓✞✑✠✔☛✕ 2
➯ ➵ ➶ ➵ ➳ ➹ ➲ ➸ ➭ ➚ ➼ ➲ ➸ ➭ ➵ ➩ ➚ ➶ ➶ ➁ ❸ ❂ ■ ✬ Ð ✬ Case study: intrusion detection • Intrusions could be detected at ✜✣✢✥✤✧✦✩★✫✪✭✬ ✮✧✯✰✮✱✬✳✲✂✴✶✵✫✵✫✴✧✷✧✸✺✹✼✻✱✽✿✾❁❀❃❂ ✯✧❂ ✮✧❄✧✮✧❅❆✾✧❀❇✻✧❄❈❀❇✴❈❉❊✹✿✬ ✸✺✮❋✬ ✾❁❀❍●✩✹✰✮❈✽✰❅❈❉✿✴❈❂ ❏✣❑✥✮✶✵☎▲▼✻❈❀❍✸✺◆✭✬ ✮✶✯✰✮✱✬✺✲✂❅✧✮❖✽❁❂ ✴✱✬ ◆❃✻✶P❇◆❍✹✩✮❈❀◗✯✧❂ ✷✩✮❘✴✶✵✫✵✫✴❁✷✶✸❙✹✧●✶✾✰✻✱❀✫✵❚◆✂✹✩✷✰✴❈✽✰✹❁●✩✮✶✵◗✷❯■ ❱❋❲❨❳❬❩✧❭☛❪❯❫✰❴✳❵❜❛ • Method ❏❞❝❡❂ ❄❈✽❯✴✧✵✂❢✧❀❇✮✧◆✭❣✰✴✧✹✰✮✧❅❤✲✐❉❊✴✧✵✫✷✧❥✼✹✧❂ ❄✱✽✰✴✶✵❃❢✧❀❚✮❦✻✶P❧✾✧❀❇✮♠✯❁❂ ✻❈❢✰✹✼✴✶✵✫✵❇✴✧✷✶✸❙✹✩■ ❱♦♥q♣❯r✰r✺s❙t♠❪❜✉❙t◗✉✩♥✈t✧r✺✉✞✇②① r✺t❚③❇❫❙❛✺① s♠r✺❛ ❏⑤④⑥✽✰✻❈❉✿✴❈✬ ⑦✩◆⑧❣♠✴✧✹✩✮✧❅❤✲⑨❉❊✻✧❅✧✮✱✬❜✽✰✻❈❀✐❉❊✴❈✬✩❢✰✹✩✴✧❄❁✮✿✴✱✽✰❅❘❅✧✮✶✵❚✮✧✷❜✵❁❅❁✮✶✯❁❂ ✴✧✵❃❂ ✻❈✽❯■ • Automatic Vs Manual: ❏❶⑩❷✴❈✽✧❢♠✴❈✬ ❱❺❹⑥① ❻❯❼✺t✶❴▼① ❛❽❛⑥❵❜♣✺t☎t◗✉♠③❇r✺❛✩❾✺❴⑥♣❜❿⑥r❜s✺t❯✉✺➀✺s❯➁ ➀q✉▼♣✩❛⑥r✺s❯③❇❴⑥♣❯➁q❫✺❛✞♣✩❻✩✉✼❵✺♣❙t☎t✫✉❯③❚r ❛❙➁ sq✇➂➁ ❿➃❪♠③❇① ➄☎t◗❛✩➅ ❏⑤④⑥❢✩✵❚✻❈❉❊✴✶✵❚✮✧❅✱❸ ❱➇➆❖❛q✉➈❼✰① ❛✈t✫s❯③❚① ♥✞♣❯➁❽♣✰❫❙❪♠① t✩t❚③✫♣❯① ❛➂♣❯r❙❪✳♣➈➁ ✉✩♣♠③❇r❜① r❜❻➉♣♠➁ ❻✩s♠③❇① t✫❼✩❴ ❱❺❹➃♣✩❿➉r❙s❙t✧❵✩③✫s❜➀❙① ❪✩✉⑥➄❚❫✩➁ ➁➊♥qs❙➀✺✉❯③✫♣✰❻❜✉ ✖✂✗☎✘ ✗✝✙✞✗✠✚☛✛ Host-level attacks on privileged programs • Attacks exploit a loophole in the program ➭✰➯❨➲❯➳ ➵✶➸ ➲✶➲❨➺ to do illegal actions ➵✶➸✰➻✶➼✶➻ ➑✣➒❈➓❯➔✱→❤➣✧↔ ↕✱➙✼↕♠➓❁➣✧↔ ➛❈➜ ➝❖➞❁➟✩➠❚➠❚↕✱➡❨➛♠➢✰↕❈➡❇➤☛➠❍↔ ➛✰➥▼➦➈➝❚➛➧➡❍➟✧➨❦➟✰➦✰↕❈➡❚➤ ➽✶➽ ➛❁➫✧↕ ➲❯➾❁➲❁➚❜➪❨➲ • What to monitor of an executing ➻❨➵ ➭❁➚ ➻❨➵ privileged program to detect attacks? ➭❁➚ ➲❯➾❁➲❁➚❜➪❨➲ • Sequence of system calls ➑✣➘ ➴➬➷✩➮❘➱✰✃✶❐❁❒✶❮Ï❰❈Ð Ð✩Ñ✰❒✧➱✰➱✶Ò Ó❁Ð ✃❦➱❜Ô❜➱✩❐❚✃✱Õ×Ö✰❰❈Ð ➱❷ØÚÙ✩Û✧Û ➳❈➺ • Mining problem: given traces of previous normal execution, monitor a new execution and flag attack or normal • Challenge: is it possible to do this given widely varying normal conditions? ➋✂➌☎➍ ➌✝➎✞➌✠➏☛➐ 3
Bio-informatics • Many recent advances in sequence analysis due to bio-informatics • Two main kinds of sequences: – Genes: â♦ãåä✧æ❖ç✰ä✱è❯é✩ä❦ê✶ë✧ì♦í✰ê❁î✩î✶ï ð❁ñ ä❋è❁ç✰é✧ñ ä❁ê✶ò❍ï ó❨ä✧î✧ô❨õ ö➬õ ÷✧ì ø❤ù➬ù➃ú➂ûåü➉ù➂úýú➂ûÚü✼ü✼ü✼úýúýú❬ù✥ù➂ûÚúýú – proteins: ä ✁�✄✂ ❤ï è❯ê ✆☎✝� ❁é✶ï ø♦ãåä✧æ❖ç✰ä✱è❯é✩ä❦ê✶ë❁þÏÿ❤í❯ê✧î✩î✧ï ð❨ñ ó✧î✧ô❘õ ö➬õ ÷❁þ❨ÿ ø ✟✞ ✩ä✱è ✡✠ ✶ò ☞☛ ❷ê✶ë❨î✰ä✧æ❈ç✰ä❈è❯é✩ä❆è ✍✌✡�✏✎ ✐ï ä❁î✿ð✰ä✶ò ✒✑ ▼ä✧ä❖è ✔✓ ✩ÿ❁ÿ✶î➈ò❚ê ✕✓ ✩ÿ❈ô ÿ❨ÿ✧ÿ • Sequence analysis in bio-informatics: rich and varied, we will concentrate on one problem – Protein family classification Ü✂Ý☎Þ Ý✝ß✞Ý✠à☛á Protein family classification • Protein families characterized by common occurrence of a few scattered amino acids in a background of other unrelated symbol • Example: three aligned sequences of a family ✖✘✗✒✙ ✗✛✚✜✗✣✢✥✤ 4
❡ ❲ Ó Ð ✃ ❒ ❮ ❡ ❪ ❘ ❭ ❩ Õ ❒ ◗ Ï ▲ Ñ × Information extraction Sequence: text string with elements as words ✬✮✭✏✯✱✰✳✲✔✴✆✵ ✶✳✷✹✸✻✺✼✺✳✽✾✶✆✿✡✿❀✶✆✿✼❁❃❂✆❄ ❂✁✽✾✶❆❅❀❇✏✽❈✺✆✿ ❉❋❊❍●❏■☞❑ ●✜▼✻◆✹❑P❖ ➬✣Ø ●❏❘✒❙✒❚❯❘ ❊❯❳❨❚ ▲❯❱ ❖✝❑❬❳ ❫✡❴✱❵❜❛❞❝ ❢❏❝ ❣✐❤❦❥♠❧☞♥✆♦❀❤q♣r❥❍❤❃s❃❝ ❥t❣✼✉✈❝ ✇✱❤②①✹③✡④⑥⑤✆❝⑧⑦⑩⑨❃❶✁❷❸⑤✆❝❺❹✼❻❃❻❃❻✱❼❃❵ ➬✾Ð ➱❍➷✜✃❈Ò ❰✆Ï ➴➫➷❍➬↕➮✹➱❍✃ ❐❋❒❬❮ ➱❯Ð ➷✜Ô ❮PÖ❯❒ ❽✼❾✥❽✼❾✒❿➁➀r➂✱➃✄➄✒➅❸➀r➆➈➇✏➉➊❾❈❽✼❾✏➋➌➆☞➀✱➍✆➎✹➀r➆➈➇❋➏⑩❾✛➐⑥❾❞➑✱➒✜➓P➔r→✒→ ➇❋➏➣❾↕↔➊❾✏➙➊→ ➀r➆❨➅❀➇✳➛❸❾✾↔➊❾ ➏➊➜✏➆☞➝✄➄ ➎✡➅⑧➞❯➟❏➠r➠✳➡✼➢t❽❃➆☞➜❃➓✈➔r➄✣➂✁➀r➂✐➝➤↔✳➜✳→ ➥✆➔r➂❀➓➦➑❆➂✱➃✳➄✣➂✱➔❃➔r➆P➄✣➂✱➃➧➜❃➨➊↔➫➩❆➭❀➓♠➄✣→✒➄ ➒✡➄✒➂✱➃ ➯✄❽❃➲❋➳❆➄✒➂✕➲❞➔❃➀r➆❨→ ➍✁➐✻➂❆➵❀➍✆➝✳➆➈➜r➩✐➒➺➸t➆➈➃✼➀r➂✼➄ ➎➼➻⑩➔❃➝✄➄ ➀✁➛❸❾ ➐➾➽⑧➔r➆❬❾r➙➊➵✐➔✏➽✕❾✏↔✳➜✆➎✐❾ ➟r➟✱➚✆➇➣➟✡➪r➪❆➡➶➟✱➹❯➟✱➪r➪❆➡r➘✳❾ Mining problem: Given a set of tags (labels) e.g. address fields, classify parts of the sequence to different labels ✦✘✧✒★ ✧✛✩✜✧✣✪✥✫ Outline Three case studies • • Sequence mining operators – Whole sequence classification – Partial sequence classification (Tagging) – Predicting next symbol of a sequence – Clustering sequences – Finding repeated patterns in a sequence • Approaches to sequence mining • Conclusion and future work Ù✘Ú✒Û Ú✛Ü✜Ú✣Ý✥Þ 5
Classification of whole sequences Given: – a set of classes C and – a number of example of instances in each class c, train a model so that for an unseen sequence we can say to which class it belongs Example: – Given a set of protein families, find family of new protein – Given a sequence of packets, predict session as intrusion or not – Given several utterances of a set of words, classify a new utterance to the right word ß✘à✒á à✛â✜à✣ã✥ä Existing methods of classification • Generative classifiers • Discriminatory classifiers • Distance based classifiers: (Nearest neighbor) • Kernel-based classifiers ß✘à✒á à✛â✜à✣ã✥ä 6
Recommend
More recommend