normalizing flow models
play

Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford University Lecture 7 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 1 / 21 Recap of likelihood-based learning so far: Model families: Autoregressive


  1. Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford University Lecture 7 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 1 / 21

  2. Recap of likelihood-based learning so far: Model families: Autoregressive Models: p θ ( x ) = � n i =1 p θ ( x i | x < i ) � Variational Autoencoders: p θ ( x ) = p θ ( x , z ) d z Autoregressive models provide tractable likelihoods but no direct mechanism for learning features Variational autoencoders can learn feature representations (via latent variables z ) but have intractable marginal likelihoods Key question : Can we design a latent variable model with tractable likelihoods? Yes! Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 2 / 21

  3. Simple Prior to Complex Data Distributions Desirable properties of any model distribution: Analytic density Easy-to-sample Many simple distributions satisfy the above properties e.g., Gaussian, uniform distributions Unfortunately, data distributions could be much more complex (multi-modal) Key idea : Map simple distributions (easy to sample and evaluate densities) to complex distributions (learned via data) using change of variables . Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 3 / 21

  4. Change of Variables formula Let Z be a uniform random variable U [0 , 2] with density p Z . What is p Z (1)? 1 2 Let X = 4 Z , and let p X be its density. What is p X (4)? p X (4) = p ( X = 4) = p (4 Z = 4) = p ( Z = 1) = p Z (1) = 1 / 2 No Clearly, X is uniform in [0 , 8], so p X (4) = 1 / 8 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 4 / 21

  5. Change of Variables formula Change of variables (1D case) : If X = f ( Z ) and f ( · ) is monotone with inverse Z = f − 1 ( X ) = h ( X ), then: p X ( x ) = p Z ( h ( x )) | h ′ ( x ) | Previous example: If X = 4 Z and Z ∼ U [0 , 2], what is p X (4)? Note that h ( X ) = X / 4 p X (4) = p Z (1) h ′ (4) = 1 / 2 × 1 / 4 = 1 / 8 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 5 / 21

  6. Geometry: Determinants and volumes Let Z be a uniform random vector in [0 , 1] n Let X = AZ for a square invertible matrix A , with inverse W = A − 1 . How is X distributed? Geometrically, the matrix A maps the unit hypercube [0 , 1] n to a parallelotope Hypercube and parallelotope are generalizations of square/cube and parallelogram/parallelopiped to higher dimensions � a � c Figure: The matrix A = maps a unit square to a parallelogram b d Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 6 / 21

  7. Geometry: Determinants and volumes The volume of the parallelotope is equal to the determinant of the transformation A � a � c det ( A ) = det = ad − bc b d X is uniformly distributed over the parallelotope. Hence, we have p X ( x ) = p Z ( W x ) | det ( W ) | = p Z ( W x ) / | det ( A ) | Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 7 / 21

  8. Generalized change of variables For linear transformations specified via A , change in volume is given by the determinant of A For non-linear transformations f ( · ), the linearized change in volume is given by the determinant of the Jacobian of f ( · ). Change of variables (General case) : The mapping between Z and X , given by f : R n �→ R n , is invertible such that X = f ( Z ) and Z = f − 1 ( X ). � ∂ f − 1 ( x ) � � �� f − 1 ( x ) � � � p X ( x ) = p Z � det � � ∂ x � Note 1: x , z need to be continuous and have the same dimension. For example, if x ∈ R n then z ∈ R n Note 2: For any invertible matrix A , det ( A − 1 ) = det ( A ) − 1 − 1 � �� � ∂ f ( z ) � � p X ( x ) = p Z ( z ) � det � � ∂ z � Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 8 / 21

  9. Two Dimensional Example Let Z 1 and Z 2 be continuous random variables with joint density p Z 1 , Z 2 . Let u = ( u 1 , u 2 ) be a transformation Let v = ( v 1 , v 2 ) be the inverse transformation Let X 1 = u 1 ( Z 1 , Z 2 ) and X 2 = u 2 ( Z 1 , Z 2 ) Then, Z 1 = v 1 ( X 1 , X 2 ) and Z 2 = v 2 ( X 1 , X 2 ) p X 1 , X 2 ( x 1 , x 2 ) � ∂ v 1 ( x 1 , x 2 ) � ∂ v 1 ( x 1 , x 2 ) �� � � ∂ x 1 ∂ x 2 = p Z 1 , Z 2 ( v 1 ( x 1 , x 2 ) , v 2 ( x 1 , x 2 )) � (inverse) � det � � ∂ v 2 ( x 1 , x 2 ) ∂ v 2 ( x 1 , x 2 ) � � ∂ x 1 ∂ x 2 � ∂ u 1 ( z 1 , z 2 ) − 1 � ∂ u 1 ( z 1 , z 2 ) �� � � ∂ z 1 ∂ z 2 = p Z 1 , Z 2 ( z 1 , z 2 ) (forward) � det � � ∂ u 2 ( z 1 , z 2 ) ∂ u 2 ( z 1 , z 2 ) � � ∂ z 1 ∂ z 2 � Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 9 / 21

  10. Normalizing flow models Consider a directed, latent-variable model over observed variables X and latent variables Z In a normalizing flow model , the mapping between Z and X , given by f θ : R n �→ R n , is deterministic and invertible such that X = f θ ( Z ) and Z = f − 1 θ ( X ) Using change of variables, the marginal likelihood p ( x ) is given by � �� � ∂ f − 1 θ ( x ) � � f − 1 � � p X ( x ; θ ) = p Z θ ( x ) � det � � ∂ x � � � Note: x , z need to be continuous and have the same dimension. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 10 / 21

  11. A Flow of Transformations Normalizing: Change of variables gives a normalized density after applying an invertible transformation Flow: Invertible transformations can be composed with each other θ ( f M − 1 x � z M = f M θ ◦ · · · ◦ f 1 θ ( z 0 ) = f M ( · · · ( f 1 θ ( z 0 )))) � f θ ( z 0 ) θ Start with a simple distribution for z 0 (e.g., Gaussian) Apply a sequence of M invertible transformations M θ ) − 1 � ∂ ( f m � �� f − 1 � � � � � p X ( x ; θ ) = p Z θ ( x ) � det � � ∂ z m � m =1 (determininant of product equals product of determinants) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 11 / 21

  12. Planar flows Planar flow (Rezende & Mohamed, 2016). Invertible transformation x = f θ ( z ) = z + u h ( w T z + b ) parameterized by θ = ( w , u , b ) where h ( · ) is a non-linearity Absolute value of the determinant of the Jacobian is given by � � � det ∂ f θ ( z ) � � � � � det ( I + h ′ ( w T z + b ) uw T ) � = � � � � ∂ z � � � � 1 + h ′ ( w T z + b ) u T w = � � � (matrix determinant lemma) Need to restrict parameters and non-linearity for the mapping to be invertible. For example, h = tanh () and h ′ ( w T z + b ) u T w ≥ − 1 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 12 / 21

  13. Planar flows Base distribution: Gaussian Base distribution: Uniform 10 planar transformations can transform simple distributions into a more complex one Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 13 / 21

  14. Learning and Inference Learning via maximum likelihood over the dataset D � � �� ∂ f − 1 θ ( x ) � � � f − 1 � � max log p X ( D ; θ ) = log p Z θ ( x ) + log � det � � ∂ x � � θ � x ∈D Exact likelihood evaluation via inverse tranformation x �→ z and change of variables formula Sampling via forward transformation z �→ x z ∼ p Z ( z ) x = f θ ( z ) Latent representations inferred via inverse transformation (no inference network required!) z = f − 1 θ ( x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 14 / 21

  15. Desiderata for flow models Simple prior p Z ( z ) that allows for efficient sampling and tractable likelihood evaluation. E.g., isotropic Gaussian Invertible transformations with tractable evaluation: Likelihood evaluation requires efficient evaluation of x �→ z mapping Sampling requires efficient evaluation of z �→ x mapping Computing likelihoods also requires the evaluation of determinants of n × n Jacobian matrices, where n is the data dimensionality Computing the determinant for an n × n matrix is O ( n 3 ): prohibitively expensive within a learning loop! Key idea : Choose tranformations so that the resulting Jacobian matrix has special structure. For example, the determinant of a triangular matrix is the product of the diagonal entries, i.e., an O ( n ) operation Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 15 / 21

  16. Triangular Jacobian x = ( x 1 , · · · , x n ) = f ( z ) = ( f 1 ( z ) , · · · , f n ( z )) ∂ f 1 ∂ f 1  · · ·  ∂ z 1 ∂ z n J = ∂ f ∂ z = · · · · · · · · ·   ∂ f n ∂ f n · · · ∂ z 1 ∂ z n Suppose x i = f i ( z ) only depends on z ≤ i . Then ∂ f 1  · · · 0  ∂ z 1 J = ∂ f ∂ z = · · · · · · · · ·   ∂ f n ∂ f n · · · ∂ z 1 ∂ z n has lower triangular structure. Determinant can be computed in linear time . Similarly, the Jacobian is upper triangular if x i only depends on z ≥ i Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 16 / 21

  17. Designing invertible transformations NICE or Nonlinear Independent Components Estimation (Dinh et al., 2014) composes two kinds of invertible transformations: additive coupling layers and rescaling layers Real-NVP (Dinh et al., 2017) Inverse Autoregressive Flow (Kingma et al., 2016) Masked Autoregressive Flow (Papamakarios et al., 2017) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 17 / 21

Recommend


More recommend