autoregressive models
play

Autoregressive Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Autoregressive Models Stefano Ermon, Aditya Grover Stanford University Lecture 3 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 1 / 1 Learning a generative model We are given a training set of examples, e.g., images


  1. Autoregressive Models Stefano Ermon, Aditya Grover Stanford University Lecture 3 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 1 / 1

  2. Learning a generative model We are given a training set of examples, e.g., images of dogs We want to learn a probability distribution p ( x ) over images x such that Generation: If we sample x new ∼ p ( x ), x new should look like a dog 1 ( sampling ) Density estimation: p ( x ) should be high if x looks like a dog, and low 2 otherwise ( anomaly detection ) Unsupervised representation learning: We should be able to learn 3 what these images have in common, e.g., ears, tail, etc. ( features ) First question: how to represent p ( x ). Second question: how to learn it. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 2 / 1

  3. Recap: Bayesian networks vs neural models Using Chain Rule p ( x 1 , x 2 , x 3 , x 4 ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) p ( x 4 | x 1 , x 2 , x 3 ) Fully General, no assumptions needed (exponential size, no free lunch) Bayes Net ✘ x 1 , x 2 ) p CPT ( x 4 | x 1 , ✘✘ p ( x 1 , x 2 , x 3 , x 4 ) ≈ p CPT ( x 1 ) p CPT ( x 2 | x 1 ) p CPT ( x 3 | ✚ x 2 , x 3 ) ✚ Assumes conditional independencies; tabular representations via conditional probability tables (CPT) Neural Models p ( x 1 , x 2 , x 3 , x 4 ) ≈ p ( x 1 ) p ( x 2 | x 1 ) p Neural ( x 3 | x 1 , x 2 ) p Neural ( x 4 | x 1 , x 2 , x 3 ) Assumes specific functional form for the conditionals. A sufficiently deep neural net can approximate any function. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 3 / 1

  4. Neural Models for classification Setting: binary classification of Y ∈ { 0 , 1 } given inputs X ∈ { 0 , 1 } n For classification, we care about p ( Y | x ), and assume that p ( Y = 1 | x ; α ) = f ( x , α ) Logistic regression : let z ( α , x ) = α 0 + � n i =1 α i x i . p logit ( Y = 1 | x ; α ) = σ ( z ( α , x )), where σ ( z ) = 1 / (1 + e − z ) Non-linear dependence: let h ( A , b , x ) = f ( A x + b ) be a non-linear transformation of the inputs ( features ). p Neural ( Y = 1 | x ; α , A , b ) = σ ( α 0 + � h i =1 α i h i ) More flexible More parameters: A , b , α Repeat multiple times to get a multilayer perceptron (neural network) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 4 / 1

  5. Motivating Example: MNIST Suppose we have a dataset D of handwritten digits (binarized MNIST) Each image has n = 28 × 28 = 784 pixels. Each pixel can either be black (0) or white (1). We want to learn a probability distribution p ( v ) = p ( v 1 , · · · , v 784 ) over v ∈ { 0 , 1 } 784 such that when v ∼ p ( v ), v looks like a digit Idea: define a model family { p θ ( v ) , θ ∈ Θ } , then pick a good one based on training data D (more on that later) How to parameterize p θ ( v )? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 5 / 1

  6. Fully Visible Sigmoid Belief Network We can pick an ordering, i.e., order variables (pixels) from top-left ( X 1 ) to bottom-right ( X n =784 ) Use chain rule factorization without loss of generality: p ( v 1 , · · · , v 784 ) = p ( v 1 ) p ( v 2 | v 1 ) p ( v 3 | v 1 , v 2 ) · · · p ( v n | v 1 , · · · , v n − 1 ) Some conditionals are too complex to be stored in tabular form. So we assume p ( v 1 , · · · , v 784 ) = p CPT ( v 1 ; α 1 ) p logit ( v 2 | v 1 ; α 2 ) p logit ( v 3 | v 1 , v 2 ; α 3 ) · · · p logit ( v n | v 1 , · · · , v n − 1 ; α n ) More explicitly: p CPT ( V 1 = 1; α 1 ) = α 1 , p ( V 1 = 0) = 1 − α 1 p logit ( V 2 = 1 | v 1 ; α 2 ) = σ ( α 2 0 + α 2 1 v 1 ) p logit ( V 3 = 1 | v 1 , v 2 ; α 3 ) = σ ( α 3 0 + α 3 1 v 1 + α 3 2 v 2 ) Note: This is a modeling assumption . We are using a logistic regression to predict next pixel based on the previous ones. Called autoregressive . Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 6 / 1

  7. Fully Visible Sigmoid Belief Network The conditional variables V i | V 1 , · · · , V i − 1 are Bernoulli with parameters i − 1 � v i = p ( V i = 1 | v 1 , · · · , v i − 1 ; α i ) = p ( V i = 1 | v < i ; α i ) = σ ( α i α i ˆ 0 + j v j ) j =1 How to evaluate p ( v 1 , · · · , v 784 )? Multiply all the conditionals (factors) In the above example: p ( V 1 = 0 , V 2 = 1 , V 3 = 1 , V 4 = 0) = (1 − ˆ v 1 ) × ˆ v 2 × ˆ v 3 × (1 − ˆ v 4 ) = (1 − ˆ v 1 ) × ˆ v 2 ( V 1 = 0) × ˆ v 3 ( V 1 = 0 , V 2 = 1) × (1 − ˆ v 4 ( V 1 = 0 , V 2 = 1 , V 3 = 1)) How to sample from p ( v 1 , · · · , v 784 )? Sample v 1 ∼ p ( v 1 ) ( np.random.choice([1,0],p=[ ˆ v 1 , 1 − ˆ v 1 ]) ) 1 Sample v 2 ∼ p ( v 2 | v 1 = v 1 ) 2 Sample v 3 ∼ p ( v 3 | v 1 = v 1 , v 2 = v 2 ) · · · 3 How many parameters? 1 + 2 + 3 · · · + n ≈ n 2 / 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 7 / 1

  8. FVSBN Results Figure from Learning Deep Sigmoid Belief Networks with Data Augmentation, 2015 . Training data on the left ( Caltech 101 Silhouettes ). Samples from the model on the right. Best performing model they tested on this dataset in 2015 (more on evaluation later). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 8 / 1

  9. NADE: Neural Autoregressive Density Estimation To improve model: use one layer neural network instead of logistic regression h i = σ ( A i v < i + c i ) v i = p ( v i | v 1 , · · · , v i − 1 ; A i , c i , α i , b i ˆ ) = σ ( α i h i + b i ) � �� � parameters     � � � � � � � �     . . . . . . For example h 2 = σ v 1 +  h 3 = σ ( v 1 v 2 ) + . . . . .     . . . .    ���� ���� � �� � ���� c 2 c 3 A 2 A 3 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 9 / 1

  10. NADE: Neural Autoregressive Density Estimation Tie weights to reduce the number of parameters and speed up computation (see blue dots in the figure): h i = σ ( W · ,< i v < i + c ) v i = p ( v i | v 1 , · · · , v i − 1 ) = σ ( α i h i + b i ) ˆ                         . . . . . .       . . . . . .              .   . .   . . .  For example             h 2 = σ v 1 h 3 = σ ( v 1 v 2 ) h 4 = σ ( v 1 v 2 v 3 )        w 1   w 1 w 2   w 1 w 2 w 3                    . . . . . .        .   . .   . . .   .   . .   . . .        � �� � � �� � � �� � A 2 A 3 A 3 How many parameters? Linear in n : W ∈ R H × n , and n logistic regression coefficient vectors α i , b i ∈ R H +1 . Probability is evaluated in O ( nH ). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 10 / 1

  11. NADE results Figure from The Neural Autoregressive Distribution Estimator, 2011 . Samples on the left. Conditional probabilities ˆ v i on the right. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 11 / 1

  12. General discrete distributions How to model non-binary discrete random variables V i ∈ { 1 , · · · , K } (e.g., color images)? Solution: let ˆ v i parameterize a categorical distribution h i = σ ( W · ,< i v < i + c ) p ( v i | v 1 , · · · , v i − 1 ) = Cat ( p 1 i , · · · , p K i ) v i = ( p 1 i , · · · , p K ˆ i ) = softmax ( V i h i + b i ) Softmax generalizes the sigmoid/logistic function σ ( · ) and transforms a vector of K numbers into a vector of K probabilities (non-negative, sum to 1). � � exp( a 1 ) exp( a K ) softmax ( a ) = softmax ( a 1 , · · · , a K ) = i exp( a i ) , · · · , � � i exp( a i ) np.exp(a)/np.sum(np.exp(a)) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 12 / 1

  13. RNADE How to model continuous random variables V i ∈ R (e.g., speech signals)? Solution: let ˆ v i parameterize a continuous distribution (e.g., mixture of K Gaussians) ˆ v i needs to specify the mean and variance of each Gaussian Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 13 / 1

  14. RNADE How to model continuous random variables V i ∈ R (e.g., speech signals)? Solution: let ˆ v i parameterize a continuous distribution (e.g., mixture of K Gaussians) K 1 � K N ( v i ; µ j i , σ j p ( v i | v 1 , · · · , v i − 1 ) = i ) j =1 h i = σ ( W · ,< i v < i + c ) v i = ( µ 1 i , · · · , µ K i , σ 1 i , · · · , σ K ˆ i ) = f ( h i ) v i defines the mean and variance of each Gaussian ( µ j , σ j ). Can use exponential ˆ exp( · ) to ensure variance is non-negative Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 14 / 1

Recommend


More recommend