Itera rati tive Dat ata a Min inin ing Jill illes V s Vreeken 26 June une 2014 2014 (TA TADA)
Ser ervic ice Ann e Announ uncemen ent #1 Evaluation Forms Hand forms out (me) 1. Fill forms out (you) 2. Collect forms (you) 3. Put forms in envelop (you) 4. Bring envelop back to Evelyn (one ‘volunteer’ and me) 5.
Ser ervic ice Ann e Announ uncemen ent #2 The Exam type: oral when: September 11 th time: individual where: E1.3 room 0.16 what: all material discussed in the lectures, plus one assignment (your choice) per topic The Re-Exam type: oral when: October 1 st time: individual where: E1.3 room 001
Ser ervic ice Ann e Announ uncemen ent #3 Master thesis projects in principle: yes! in practice: depending background, motivation, interests, and grades --- plus, on whether I have time interested? mail me and/or Pauli Student Research Assistant (HiWi) positions in principle: maybe! in practice: depends on background, grades, and in particular your motivation and interests interested? mail me and/or Pauli, include CV and grades
Ser ervic ice Ann e Announ uncemen ent #4 Tensors Introduction - Introduction to tensors - Is DM science? - Tensors in DM - DM in action - Special topics in tensors Information Theory Mixed Grill - MDL + patterns - Influence Propagation - Entropy + correlation - Redescription Mining - MaxEnt + iterative DM - <special request>
Ser ervic ice Ann e Announ uncemen ent #4 Tensors Introduction <special request>? - Introduction to tensors - Is DM science? - Tensors in DM - DM in action - Special topics in tensors Let us know (asap, mail) what topic you would Information Theory Mixed Grill like to see discussed - MDL + patterns - Influence Propagation - Entropy + correlation - Redescription Mining - MaxEnt + iterative DM - <special request>
Ser ervic ice Ann e Announ uncemen ent #5 Introduction Tensors Information Theory Mixed Grill Wrap-up + < ask-us-anything>
Ser ervic ice Ann e Announ uncemen ent #5 <ask-us-anything>? Introduction Yes! Prepare questions on Tensors anything* you’ve always wanted to ask Pauli and/or me. Information Theory We’ll answer on the spot Mixed Grill * preferably related to TADA, data mining, machine learning, science, the world, etc. Wrap-up + < ask-us-anything>
Go Good R d Rea eads ds Data Analysis: a Bayesian Tutorial Elements of Information Theory The Information D.S. Sivia & J. Skilling Thomas Cover & Joy Thomas James Gleick (very good, but skip the MaxEnt stuff) (very good textbook) (great light reading)
Itera rati tive Dat ata a Min inin ing Jill illes V s Vreeken 26 June une 2014 2014 (TA TADA)
Qu Question o of f th the da day How can we find things that are interesting with regard to what we already know ? How can we measure subjective interestingness ?
Wha hat is is int inter eres estin ing? something that increases our knowledge about the data
Wha hat is is a go good r d result esult? something that reduces our uncertainty about the data (ie. increases the likelihood of the data)
Wha hat is is rea eally lly g good? d? something that, in simple terms, strongly reduces our uncertainty about the data (maximise likelihood, but avoid overfitting)
Let et’s m s make e this v is visua isual universe of possible datasets our dataset D
dimensions, margins Giv Given en wh what we we kno now possible datasets, given current knowledge all possible our dataset D datasets
dimensions, margins, Mo More k e kno nowled wledge. ge... pattern P 1 all possible our dataset D datasets
dimensions, margins, Fewe ewer p possib ssibilit ilities es... patterns P 1 and P 2 all possible our dataset D datasets
dimensions, margins, Less u ess unc ncer ertain inty. the key structure all possible our dataset D datasets
dimensions, margins, Ma Maxim ximis isin ing c cer ertain inty patterns P 1 and P 2 knowledge added by P 2 all possible our dataset D datasets
Ho How c w can n we we def define ine ‘uncertainty’ and ‘simplicity’? interpretability and informativeness are intrinsically subjective
Mea Measu surin ing U g Uncer ertain inty We need access to the likelihood of data D given background knowledge B such that we can calculate the gain for X …which distribution should we use?
Mea Measu surin ing S g Sur urpris ise We need access to the likelihood of result X given background knowledge B such that we can mine the data for X that have a low likelihood, that are surprising …which distribution should we use?
Measu Mea surin ing S g Sur urpris ise We need access to the likelihood of result X given background knowledge B This is called the p-value of result X such that we can mine the data for X that have a low likelihood, that are surprising …which distribution should we use?
Measu Mea surin ing S g Sur urpris ise We need access to the likelihood of result X given background knowledge B This is called the p-value of result X such that we can mine the data for X that have a low likelihood, that are surprising …which distribution should we use?
Approach 1: Rando ndomiz izatio ion Mine original data 1. Mine random data 2. Determine probability 3. Random Random Random Original ... data #1 data #2 data #N data score ( X | D )
Approach 1: Rando ndomiz izatio ion Mine original data 1. Mine random data 2. Determine probability The fraction of better ‘randoms’ is the 3. empirical p-value of result X Random Random Random Original ... data #1 data #2 data #N data score ( X | D )
Approach 1: Rando ndomiz izatio ion Mine original data 1. Mine random data 2. Determine probability The fraction of better ‘randoms’ is the 3. empirical p-value of result X Random Random Random Original ... data #1 data #2 data #N data score ( X | D )
Rando ndom Da Data So, we need data that maintains our background knowledge, and is otherwise completely random How can we get our hands on that?
Swa wap R Rando ndomiz izatio ion Let there be data 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 (swap randomization, Gionis et al. 2005)
Swa wap R Rando ndomiz izatio ion Say we only know overall density. How to sample random data? 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 27 (swap randomization, Gionis et al. 2005)
Swa wap R Rando ndomiz izatio ion Didactically, let us instead consider a Monte-Carlo Markov Chain 1 1 1 0 1 1 1 Very simple scheme 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1. select two cells at random, 1 1 1 1 0 0 1 2. swap values, 3. repeat until convergence . 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 27 (swap randomization, Gionis et al. 2005)
Swa wap R Rando ndomiz izatio ion Margins are easy understandable for binary data, how can we sample data with same margins? 1 1 1 0 1 1 1 6 0 1 1 0 1 0 1 4 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)
Swa wap R Rando ndomiz izatio ion By MCMC! 1 1 1 0 1 1 1 6 1. randomly find submatrix 0 1 1 0 1 0 1 4 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)
Swa wap R Rando ndomiz izatio ion By MCMC! 1 1 1 0 1 1 1 6 1. randomly find submatrix 0 1 1 0 1 0 1 4 1 1 1 1 0 0 0 4 1 1 1 1 0 0 1 5 0 1 1 1 0 0 0 3 0 1 1 1 0 1 0 4 0 0 0 0 1 0 0 1 2. swap values 3 6 6 4 3 2 3 27 (swap randomization, Gionis et al. 2005)
Swa wap R Rando ndomiz izatio ion By MCMC! 1 1 1 1 1 1 0 0 1 1 1 1 1 1 6 6 1. randomly find submatrix 1 0 0 1 1 1 1 1 0 0 1 0 0 1 4 4 1 1 1 1 1 1 0 1 0 1 0 0 0 0 4 4 1 1 1 1 1 1 1 1 1 0 0 0 0 1 5 5 0 0 1 1 1 1 1 0 0 0 0 0 1 0 3 3 0 0 1 1 1 1 1 1 0 0 1 1 0 0 4 4 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 2. swap values 3 3 6 6 6 6 4 4 3 3 2 2 3 3 27 27 3. repeat until convergence (swap randomization, Gionis et al. 2005)
Static ic Mo Models dels Many ways to test static null hypothesis assuming distribution, swap-randomization, MaxEnt What can we use this for? ranking based on static significance mining the top-k most significant patterns, but not suited for iterative mining
Dynamic Mo Dy Models dels For iterative data mining, we need models that can maintain the type of information (eg. patterns) that we mine Randomization is powerful variations exists for many data types (Ojala ‘09, Henelius et al ’13) can be pushed beyond margins (see Hanhijärvi et al 2009) but… has key disadvantages
Recommend
More recommend