Feature extraction from deep models Olgert Denas
Synopsis Intro to deep models Applications ● Neurons & Nets ● dimer ● Learning & Depth ● G1E model Feature extraction ● Theory ● 1 Layer ● Nets
Neural computation Inspired by organic neural systems A system of simple computing units with learnable parameters Intended for conventional computing efficient arithmetic and calculus but von Neumann’s architecture “won”
Neural computation Mainly in machine learning Declarative: unambiguous sort an array of integers Procedural: can only state by examples find fraud in network logs
Artificial Neural Nets
Neurons
Neurons The artificial neuron is very different from the biological one, after all it is a model
Neurons Natural - organic Artificial transfer function complicated parametric function mixed communication discrete or continuous continuous/impulse no state, output is f(x; θ), state, chemical, physical changes fixed connections synaptic delays, long axon computational delays
Nets of neurons
Computers and brains Brain Computer speed ms / operation ns / operation size Tera nodes, Peta conns Giga nodes Memory content addressable, in contiguous, random access connections Computing Distributed / fault tolerant Centralized / non-ft Power 10W ~ 300W (GPU)
Organic vs. artificial computer
ANNs architectures Feed forward NNs (and CNNs) Recurrent NNs RBMs
Feed Forward Directed Acyclic Graph Input (first), hidden, and output (last) layers Connections from a layer to next Transfer functions are nonlinearities
Recurent Directed graph with cycles Possibly, hidden layers More complicated, realistic, and powerful Well-suited to sequential input Unroll the hidden state, just like DBNs
Restricted Boltzman Machines Probabilistic model (energy function) A bipartite graph (visible <->hidden) Efficient inference
ANN: Learning
Learning: perceptron Loop through labeled examples Output Unit - on incorrect output: * case 0: w <- w + x W1 W2 * case 1: w <- w - x Input Units X1 X2 Guaranteed separating hyperplane
Learning: perceptron Parity, or counting problem: recognize binary strings of length 2 with exactly one 1 red class: 01, 10 green class: 00, 11 Many other problems (Minsky & Papert 1969)
Learning: features Output Unit Input Units
Learning: features 00: no unit is activated => 0 11: hidden unit cancels inputs 01, 10: inputs connect directly to output 0 0
Learning: features 00: no unit is activated => 0 11: hidden unit cancels inputs 01, 10: inputs connect directly to output 1 1
Learning: features 00: no unit is activated => 0 .5 11: hidden unit cancels inputs 01, 10 : inputs connect directly to output 0 1
Learning: features 00: no unit is activated => 0 .5 11: hidden unit cancels inputs 01 , 10: inputs connect directly to output 1 0
Learning: perceptron Perceptron guaranties a SH if a SH exists Learning from input features requires a lot of “(big) data science” Have the NN do the “(big) data science!”
Deep supervised learning paradigm Map “raw” input into intermediate hidden layers Deep means: more layers, means more efficient, means harder-to-train Classify the hidden representation of data Learn weights for both steps above using backprop or pre- training
Feature extraction
Feature extraction Trained NNs can be used to predict, but they are black boxes It is hard to relate high weights with input features How do we map features from hidden layers back to the input space?
Learning W, b Batch SGD Early stop, regularization and a lot of tricks Maximize average of P(Y|X;θ) over training data I.e., find a θ with low entropy
Feature extraction: 1 layer P(Y|X;θ) = f(WX T + b)
Feature extraction: 1 layer Y 0 1 Given trained model and label, find input: P(Y | E[X 0 ]) c 0 = f θ (E[X 0 ]) 2/3 1/3 * with that label θ = {W, b} * minimized gray area E[X 0 ]
Feature extraction: 1 layer Y 0 1 Given trained model and label, find input: P(Y | E[X 0 ]) c 0 = f θ (E[X 0 ]) 2/3 1/3 * with that label θ = {W, b} * minimized gray area E[X 0 ]
Feature extraction: 1 layer l: label X l : input features E[ X l ]: input average for that label f θ (E[ X ]): decision boundary c l : f θ (E[ X l ]), constraint boundary ε: slack (see below) This is an LP !
Feature extraction on a stack
Feature extraction: ε The slack variable is a control on the CE achieved by extracted features Useful, if avg. input achieves 0.01 CE, but you are happy with 0.2
Linear programing (in 1 page) Optimization problems that: minimize a linear cost function satisfy linear constraints very efficient, for continuous variables (simplex)
Feature extraction: implementation
Mnist digits 28x28 pixel binarized handwritten digit images pick pairs and extract differentiating features
Effect of ε on |X l |
Effect of optimization
Features
Feature extraction: applications
Hematopoiesis & erythroid diff. Genes dev. 8(10):1184-97, 1994 Genome Res. 21(10):1659-71, 2011
Application: G1e Model
dimer
dimer is @ http://bitbucket.org/gertidenas/dimer PULL IT!
Recommend
More recommend