a tour of machine learning
play

A tour of machine learning ... guided by a complete amateur - PowerPoint PPT Presentation

A tour of machine learning ... guided by a complete amateur Thomas Dullien, Google Topics to cover 1. Logistic regression 2. Word embeddings 3. t-SNE 4. Deep Networks (and some transfer learning) 5. Hidden Markov Models for sequence


  1. A tour of machine learning ... … guided by a complete amateur Thomas Dullien, Google

  2. Topics to cover 1. Logistic regression 2. Word embeddings 3. t-SNE 4. Deep Networks (and some transfer learning) 5. Hidden Markov Models for sequence tagging 6. Conditional Random Fields for sequence tagging 7. Reinforcement learning 8. Approximate NN and k-NN methods 9. Tree ensemble methods

  3. Logistic Regression ● Also known as “maximum entropy modelling” ● Mathematically simple, easy to diagnose / inspect ● Idea: Approximate a conditional probability distribution from (labeled) training data ● Consider output classes and features

  4. Logistic Regression ● Parameters that are learnt are a k x n matrix of weights ● Easily diagnosable: For each decision, contribution of each feature can be easily read-off ● Features need to be provided / engineered distribution from (labeled) training data ● Various subleties need to be observed: ○ Lots of correlated features can make training convergence arbitrarily slow ○ Features with arbitrary values can be permitted ○ Various optimization algorithms: “Iterative Scaling”, L-BFGS, SGD

  5. Logistic Regression Example implementations: Maxent Toolkit: https://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html Tensorflow Tutorial: https://www.tensorflow.org/get_started/mnist/beginners

  6. Word embeddings ● Extracting “meaning” from a word is difficult ● Words in a language are often related, but this relationship is not easily inferred from the written form of the word ● Letter-by-letter-similarity does not imply any semantic similarity ● Is it possible to build a dictionary that maps words into a space where some semantic relationships are represented? ● Yes - word2vec et. al.

  7. Word embeddings ● Idea: Try to train a model that predicts contexts for a given word ● Train in a way that produces a vector representation of the word ● Vector representations are then used as stand-in for the written word in further applications

  8. Word embeddings: Word2Vec “The quick brown fox jumped over the lazy dog” Target word Context Context

  9. Word embeddings: Word2Vec Let be training vectors consisting of target words & their context. Then optimize the following function:

  10. Word embeddings: Word2Vec “For each word find two vectors v_in and v_out so that the performance of the prediction of the words surrounding it is maximized.” Strange results of the embedding: Vectors were successfully used for solving analogies. Some controversy exists about how much semantics are extracted, and if the strange linear relationships are better explained by “noise”. Words used in similar contexts are “close” in embedding.

  11. Word embeddings: Word2Vec Example implementation: https://github.com/dav/word2vec

  12. t-SNE ● Common problem in ML: Understanding relationships between high-dimensional vectors ● Difficult to plot :-) ● t-SNE: Commonly used algorithm to visualize high-dimensional data in 2D or 3D ● Attempts to optimize a mapping so that nearby points are close in the projection, and non-near points are at distance in the projection

  13. t-SNE Example implementation: https://github.com/lvdmaaten/bhtsne/

  14. Deep Neural Networks ● Big hype since Hinton’s 2006 breakthrough results ● Didn’t work for decades, started working in 2006 ● Reasons why they started working are still poorly understood

  15. Deep Neural Networks ● Big hype since Hinton’s 2006 breakthrough results ● Didn’t work for decades, started working in 2006 ● Reasons why they started working are still poorly understood Last layer is just logistic regression

  16. Deep Neural Networks Lower layers can be viewed as feature extractors for the last-layer logistic regression. Last layer is just logistic regression

  17. Deep Neural Networks ● Mathematically essentially iterated matrix multiplication with interleaved nonlinear function ● Each layer is of the form:

  18. Deep Neural Networks ● Structure of the DNN is encoded in restrictions on the shape of the matrices ● Convolutional NN’s also force many weights in the lower layers to be the same (translation invariance, locality) ● Modern DNNs often use ReLu etc. instead of sigm sigmoid some other non-linear options

  19. Deep Neural Networks ● Huge success in areas where feature engineering was traditionally very hard ○ Image processing tasks ○ Speech recognition tasks ○ ... ● Data-hungry: Many parameters to estimate, clearly one needs a fair amount of data to estimate them well ● Good way to think about non-recurrent DNNs: Sophisticated feature extractors for logistic regression.

  20. Deep Neural Networks Lots of competing implementations now. Simply google “deep learning framework”. Tensorflow, Keras, Torch, Caffee etc.

  21. Transfer learning ● Lower layers of DNN extract structure from input ● Image processing example: Edge detection, shapes etc. ● Low-level features for task A may be useful features for task B, too ● Transfer learning: Take DNN trained on task A, then try to re-train it to perform task B ● Example: Google inception NN, Hotdog / not Hotdog app, other example later

  22. HMMs for sequence tagging ● Consider the problem of assigning a sequence of syllables to an audio sample ● Space to classify over grows exponentially ● Think of a person’s voice as a state machine

  23. HMMs for sequence tagging ● Depending on what syllable is currently pronounced, audio spectrum changes ● Voice probabilistically transitions between states ● Training an HMM: ○ Specify the structure of the state machine ○ Provide labeled data to infer … ■ Transition probabilities between states ■ Distribution of data emitted at state ● Inference in HMMs: ○ Provide data sequence to infer … ■ Most likely sequence through state machine that would have produced the data sequenc

  24. HMMs for sequence tagging ● Limitation: Independence assumption: ○ only the current state determines data distribution ○ only the current state determines transition probabilities to the next state ● Generative model: ○ Easy to “sample” from the distribution the model learnt ○ Everybody has seen Markov Twitter Bots?

  25. HMMs for sequence tagging Example implementation: http://ghmm.sourceforge.net/ghmm-python-tutorial.html Rabiner’s very accessible HMM tutorial: https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf

  26. CRFs for sequence tagging ● HMM independence assumption for state transitions often not true in practice ● Example: Part-of-speech-tagging ○ Probability of a word being of a particular type depends on the type assigned to previous word ● HMMs model joint distribution, but we normally want conditional distribution ● CRFs are the sequence-form of logistic regression: ● Linear-chain CRFs computationally tractable ● More complex dependencies can make them intractable

  27. CRFs for sequence tagging Pretty high-performance example implementation: https://wapiti.limsi.fr/ Corresponding paper: “Practical very large scale CRFs” http://www.aclweb.org/anthology/P10-1052

  28. Approximate Nearest Neighbor Search Consider a family of hash functions (from the domain you wish to search to some domain) is locality-sensitive if there is

  29. What does this mean? “For similar objects, the odds of a randomly drawn hash function to evaluate to the same value should be higher than for dissimilar objects.”

  30. LSH for similarity search ● Often a matter of designing a good hash function family for your domain ● Rest of the implementation is mostly “pluggable” ● For Euclidean and angular distance, several good, public, FOSS libraries exist that can be used off-the-shelf

  31. ANNoy and FalcoNN ANNoy FalcoNN ● Partition space into ● Use a particular halves by random polytope hash sampling & centroids ● Build a tree structure out of these halves ● Build N such trees Both work pretty well -- FOSS C++ libraries, easy-to-use Python bindings.

  32. Geometric intuition behind ANNoy

  33. Pick two random points to start

  34. Pick a new random point

  35. Measure distance to initial points

  36. Pick closer element

  37. Calculate average

  38. Repeat with new point

  39. Result: Two “centroids”

  40. Split space in the middle between

  41. Repeat on both sides until buckets small

  42. Repeat on both sides until buckets small

  43. Result: Tree tiling of our space

  44. Each color: Tree-leaf / hash bucket

  45. ANNoy intuition ● Each tree is a “hash function” (maps a point to a bucket) ● Easy to generate a new tree (sample random points, two centroids etc) ● Nearby points have higher probability to end up in same bucket than far-away points ● ⇒ A family of locality-sensitive hashes

  46. Example: Image similarity search... … in < 100 lines of Python. ● How to best turn pictures into vectors of reals? ● Image-classification Deep Neural Networks do this - if you just cut off the last layer ● Step 1: Convert image files to real vectors by using a pre-trained image classification CNN and “cut off” the last layer

Recommend


More recommend