data mining techniques
play

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei) PROJECT GUIDELINES ( updated ) Project Goals Select a dataset / prediction problem Perform exploratory analysis


  1. Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei)

  2. PROJECT GUIDELINES ( updated )

  3. Project Goals • Select a dataset / prediction problem • Perform exploratory analysis 
 and preprocesssing • Apply one or more algorithms • Critically evaluate results • Submit a report and present project

  4. Proposals • Due: 28 October • Presentation:10+5 mins • Proposal: 1-2 pages • Describe • Dataset • Prediction task • Proposed methods

  5. Presentation and Report • Due: 2 December • Presentation • 20 mins + 10 discussion • Report • 8-10 pages, 11 pts • Code

  6. Presentation and Report • Due: 2 December • Presentation • 20 mins + 10 discussion • Report • 8-10 pages, 11 pts • Code

  7. Grading • Proposal: 15% • Problem and Results: 20% • Data and Code: 15% • Report: 35% • Presentation: 15%

  8. Grading • Problem and Results: 20% • Novelty of task • Own dataset vs UCI dataset • Number of algorithms tested • Novelty of algorithms

  9. Grading • Data and Code: 15% • Documentation and Readability • TAs should be able to run code • Reproducibility 
 (can figures and tables be generated by running code?)

  10. Grading • Report: 35% • Exploratory analysis of data • Explain how properties of data 
 relate to choice of algorithm • Description of algorithms 
 and methodology • Discussion of results • Which methods work well, 
 which do not, and why? • Comparison to state of art?

  11. Example: Minimum Viable Project • Get 2-3 datasets 
 from UCI repository • Figure out what pre-processing 
 (if any) is needed • Run every applicable 
 algorithm in scikit learn • Explain which algorithms work well 
 on which datasets and why

  12. Example: More Ambitious Projects • Find a new dataset or define a novel task 
 ( i.e. not classification or clustering) • Attack a problem from a Kaggle competition • Implement a recently published method 
 (talk to me for suggestions)

  13. Homework Updates • HW3 now due on 2 November 
 (after midterm and proposals) • Removed HW5 to give more 
 time to work on projects 


  14. MIDTERM REVIEW

  15. List of Topics for Midterm http://www.ccs.neu.edu/course/cs6220f16/sec3/midterm-topics.html • Everything up until last Friday 
 (expect final to emphasize later topics) • Open book, focus on understanding

  16. BINOMIAL MIXTURES

  17. Mixture of Binomials Suppose we have two coins A and B (weighted). We want to estimate the bias of the two coins. i.e. p A ( head ) = µ A p B ( head ) = µ B Pick a coin at random (simplified version, a equal mixture) Flip 10 times and record ’H’ and ’T’ repeat the process until we have a good size of training data

  18. Mixture of Binomials

  19. Gaussian Mixture Model Generative Model Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 2. Maximization Step

  20. Binomial Mixture Model Generative Model Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 2. Maximization Step

  21. Binomial Mixture Model Generative Model Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 2. Maximization Step

  22. TOPIC MODELS Borrowing from : 
 David Blei 
 (Columbia)

  23. Review: Naive Bayes Features: Words in E-mail Generative Model 1 a   0 aardvark    0  aardwolf   . .   . . x = . .     Maximum Likelihood 1 buy   n   . . . .   . .   0 zygmurgy Labels: Spam or not Spam

  24. Review: Naive Bayes Features: Words in E-mail Generative Model (with prior) 1 a   0 aardvark    0  aardwolf   . .   . . x = . .     Maximum Likelihood 1 buy   n   . . . .   . .   0 zygmurgy Posterior Mean for Parameters Labels: Spam or not Spam

  25. Mixtures of Documents Observations: Bag of Words 1 a   24 0 aardvark 1    0  aardwolf   . .   . . x = . .     1 buy   4 n   . . . .   . .   0 zygmurgy Clusters: Types of Documents

  26. Mixtures of Documents Observations: Bag of Words Generative Model (with prior) 1 a   24 0 aardvark 1    0  aardwolf   . .   . . x = . .     Maximum Likelihood 1 buy   4 n   . . . .   . .   0 zygmurgy How should we modify 
 the generative model? Clusters: Types of Documents

  27. Mixtures of Documents Observations: Bag of Words Generative Model (with prior) 1 a   24 0 aardvark 1    0  aardwolf   . .   . . x = . .     1 buy   4 n   . . . .   . .   0 zygmurgy Clusters: Types of Documents

  28. Topic Modeling Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Naive Bayes : Documents belong a class • Topic Models : Words belong to a class

  29. Latent Dirichlet Allocation Per-word Proportions topic assignment parameter Per-document Topic Observed Topics topic proportions parameter word α θ d Z d,n W d,n β k η N D K

  30. PLSI/PLSA: EM for LDA Generative Model (no priors) Expectation Step Maximization Step

  31. Variational Inference for LDA (sketch) Generative Model LDA: θ d α Z d,n W d,n β k η N D K Variational Approximation % d,n ! d,n # k $ k N " d ! d D K

  32. Variational Inference for LDA (sketch) Generative Model LDA: θ d α Z d,n W d,n β k η N D K Variational Approximation % d,n ! d,n # k $ k N " d ! d D K

  33. Variational Inference for LDA (sketch) One iteration of mean field variational inference for LDA (1) For each topic k and term v : D N ⇥ ⇥ λ ( t + 1 ) 1 (w d , n = v) φ ( t ) = η + n , k . (8) k ,v d = 1 n = 1 (2) For each document d : (a) Update γ d : = α k + � N γ ( t + 1 ) n = 1 φ ( t ) (9) d , n , k . d , k (b) For each word n , update ⌅ φ d , n : ⌅ ⇧ k ,w n ) � Ψ ( � V φ ( t + 1 ) Ψ ( γ ( t + 1 ) ) + Ψ ( λ ( t + 1 ) v = 1 λ ( t + 1 ) d , n , k ⇥ exp ) , (10) d , k k ,v where Ψ is the digamma function, the first derivative of the log Γ function.

  34. Example Inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics

  35. Example Inference human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

  36. Example Inference

  37. Example Inference problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem

  38. Performance Metric: Perplexity Nematode abstracts Associated Press 3400 7000 Smoothed Unigram Smoothed Unigram Smoothed Mixt. Unigrams Smoothed Mixt. Unigrams 3200 6500 LDA LDA Fold in pLSI Fold in pLSI 3000 6000 2800 5500 Perplexity 2600 Perplexity 5000 2400 4500 2200 4000 2000 3500 1800 1600 3000 1400 2500 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics 7000 ⇢ − P d log p ( w d ) � perplexity = exp P d N d Marginal likelihood (evidence) of held out documents

  39. Extensions of LDA • EM inference (PLSA/PLSI) yields similar 
 results to Variational inference (LDA) on most data • Reason for popularity of LDA: 
 can be embedded in more complicated models

  40. Extensions: Correlated Topic Model β k Σ η d Z d,n W d,n N D K µ Noconjugate prior on topic proportions Estimate a covariance matrix Σ that parameterizes correlations between topics in a document

  41. Extensions: Dynamic Topic Models 1789 2009 Inaugural addresses My fellow citizens: I stand here today humbled by the task AMONG the vicissitudes incident to life no event could before us, grateful for the trust you have bestowed, mindful have filled me with greater anxieties than that of which of the sacrifices borne by our ancestors... the notification was transmitted by your order... Track changes in word distributions 
 associated with a topic over time.

Recommend


More recommend