The Art of Predictive Analytics: More Data, Same Models [STUDY SLIDES] Joseph Turian joseph@metaoptimize.com @turian 2012.02.02 MetaOptimize
NOTE: These are the STUDY slides from my talk at the predictive analytics meetup: http://bit.ly/xVLBuS I have removed some graphics, and added some text. Please email me any questions
Who am I? Engineer with 20 yrs coding exp PhD 10 yrs exp: large-scale ML + NLP Founded MetaOptimize
What is MetaOptimize? Consultancy + community on: Large-scale ML + NLP Well engineered solutions
“Both NLP and ML have a lot of folk wisdom about what works and what doesn't. [This site] is crucial for sharing this collective knowledge.” - @aria42 http://metaoptimize.com/qa/
http://metaoptimize.com/qa/
http://metaoptimize.com/qa/
“A lot of expertise in machine learning is simply developing effective biases .” -Dan Melamed (quoted from memory)
What's a good choice of learning rate for the second layer of this neural net on image patches? [intuition] 0.02! (Yoshua Bengio)
Occam's Razor is a great example of ML intuition
Without the aid of prejudice and custom I should not be able to find my way across the room. - William Hazlitt
It's fun to be a geek
Be an artist
Be an artist
How to build the world's biggest langid (langcat) model?
+ Vowpal Wabbit = Win
How to build the world's biggest langid (langcat) model? SOLVED.
The art of predictive analytics: 1) Know the data out there 2) Know the code out there 3) Intuition (bias)
A lot of data with one feature correlated with the label
Twitter sentiment analysis?
“Distant supervision” (Go et al., 09) Awesome! RT @rupertgrintnet Harry Potter Marks Place in Film History http://bit.ly/Eusxi :) (Use emoticons as labels)
Recipe: You know a lot about the problem Smart Priors
You know a lot about the problem: Smart Priors Yarowsky (1995), WSD 1) One sense per collocation. 2) One sense per discourse.
Recipe: You know a lot about the problem Create new features
You know a lot about the problem: Create new features Error-analysis
What errors is your model making? DO SOME EXPLORATORY DATA ANALYSIS (EDA)
Andrew Ng: “Advice for applying ML” Where do the errors come from?
Recipe: You know a little about the problem Semi-supervised learning
You know a little about the problem: Semi-supervised learning JOINT semi-supervised learning Ando and Zhang (2005) Suzuki and Isozaki (2008) Suzuki et al. (2009), etc. => effective but task-specific
You know a little about the problem: Semi-supervised learning Unsupervised learning, followed by Supervised learning
How can Bob improve his model? Sup Sup data model Supervised training 34
Semi-sup training? Sup Sup data model Supervised training 35
Semi-sup training? More feats Sup Sup data model Supervised training 36
More sup task 1 feats Sup Sup data model More features can be used on different tasks More sup task 2 feats Sup Sup data model 37
Unsup Joint semi-sup data Semi-sup model Sup data (standard semi-sup setup) 38
Unsup data Unsup model unsup pretraining Sup Semi-sup data model semi-sup fine-tuning 39 Unsupervised, then supervised
Unsup data Unsup model unsup training unsup feats Use unsupervised learning to create new features 40
Unsup data unsup feats unsup training These features can then be shared with other people Sup data Semi-sup model Sup training 41
Unsup data unsup feats unsup training sup task 1 sup task 2 sup task 3 42
Recipe: You know almost nothing about the problem Build cool generic features
Know almost nothing about problem: Build cool generic features Word features (Turian et al., 2010) http://metaoptimize.com/projects/wordreprs/
Brown clustering (Brown et al. 92) cluster(chairman) = `0010’ 2-prefix(cluster(chairman)) = `00’ 45 (image from Terry Koo)
50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008) 46
Know almost nothing about problem: Build cool generic features Document features: Document clustering LSA/LDA Deep model
Document features Salakhutdinov + Hinton 06
Document features example Domain adaptation for sentiment analysis (Glorot et al. 11)
Recipe: You know a little about the problem Make more REAL training examples
Make more real training examples Cuz you have some time or a small budget Amazon Mechanical Turk
Snow et al. 08 “Cheap and Fast – But is it Good?” 1K turk labels per dollar Average over (5) Turks to reduce noise => http://crowdflower.com/
Soylent (Bernstein et al. 10) Find-Fix-Verify: Crowd control design pattern Find a Fix each Verify quality problem problem of each fix Soylent, a Soylent, a Soylent, a prototype... Soylent, a prototype... prototype... prototype...
Make more real training examples Active learning
Dualist (Settles 11) http://code.google.com/p/dualist/
Dualist (Settles 11) http://code.google.com/p/dualist/ Applications: Document categorization WSD Information Extraction Twitter sentiment analysis
You know a little about the problem: Make more training examples FAKE training examples
NOISE
FAKE training examples Denoising AA RBM
MNIST distortions (LeCun et al. 98)
No negative examples?
FAKE training examples Multi-view / multi-modal
Multi-view / multi-modal How do you evaluate an IR system, if you have no labels? See how good the title is at retrieving the body text.
2) KNOW THE DATA
Know the data Labelled/structured data: ODP, Freebase, Wikipedia, Dbpedia, etc.
Know the data Unlabelled data: WaCKy, ClueWeb09, CommonCrawl, Ngram corpora
Ngrams Google Bing Google Books Roll your own: Common crawl
Know the data Do something stupid on a lot of data
Do something stupid on a lot of data: Ngrams Spell-checking Phrase segmentation Word breaking Synonyms Language models See “An Overview of Microsoft Web N-gram Corpus and Applications” (Wang et al 10)
Do something stupid on a lot of data Web-scale k-means for NER (Lin and Wu 09)
Do something stupid on a lot of data Web-scale clustering
Know the data Multi-modal learning
Multi-modal learning Images and captions = features features “facepalm”
Multi-modal learning Titles and article body = features features Article body Title
Multi-modal learning Audio and tags = features features “upbeat”, “hip hop”
3) IT'S MODELS ALL THE WAY DOWN
Break down a pipeline 1-best (greedy), k-best, Finkel et al. 06
Good code to build on Stanford NLP tools, clustering algorithms, Terry Koo's parser, etc.
Good code to build on YOUR MODEL
Eat your own dogfood Bootstrapping (Yarowsky 95) Co-training (Blum+Mitchell 98) EM (Nigam et al., 00) Self-training (McClosky et al., 06)
Dualist (Settles '11) Active learning + semisup learning
Eat your own dogfood Cheap bootstrapping: One step of EM (Settles 11) “Awesome! What a great movie!”
It's models all the way down Use models to annotate Low recall + high precision + lots of data = win
Use models to annotate Face modeling
Pose-invariant face features
Pose-invariant face features
It's models all the way down THE FUTURE? Joins on large noisy data sets
Joins on large noisy data sets ReVerb (Fader et al., 11) http://reverb.cs.washington.edu Extractions over entire ClueWeb09 (826 MB compressed)
ReVerb (Fader et al., 11)
Joins on noisy data sets (can clean up the data??) ??? ⋈
The art of predictive analytics: 1) Know the data out there 2) Know the code out there 3) Intuition (bias)
Summary of recipes: Know your problem Throw in good features Use other's good models in yr pipeline Make more training examples Use a lot of data
"It especially annoys me when racists are accused of 'discrimination.' The ability to discriminate is a precious facility; by judging all members of one 'race' to be the same, the racist precisely shows himself incapable of discrimination." - Christopher Hitchens (RIP)
Other cool research to look at: * Frustratingly easy domain adaptation (Daume 07) * The Unreasonable Effectiveness of Data (Halevy et al 09) * Web-scale algorithms (search on http://metaoptimize.com/qa/) * Self-taught learning (Raina et al 07)
Please email me any questions Joseph Turian joseph@metaoptimize.com @turian http://metaoptimize.com/qa/ 2012.02.02
Recommend
More recommend