CS11-747 Neural Networks for NLP Models w/ Latent Random Variables Chunting Zhou Site https://phontron.com/class/nn4nlp2019/ With Slides from Graham Neubig
Discriminative vs. Generative Models • Discriminative model: calculate the probability of output given input P(Y|X) • Generative model: calculate the probability of a variable P(X), or multiple variables P(X,Y) • Which of the following models are discriminative vs. generative? • Standard BiLSTM POS tagger • Globally normalized CRF POS tagger • Language model
Types of Variables • Observed vs. Latent: • Observed: something that we can see from our data, e.g. X or Y • Latent: a variable that we assume exists, but we aren’t given the value • Deterministic vs. Random: • Deterministic: variables that are calculated directly according to some deterministic function • Random (stochastic): variables that obey a probability distribution, and may take any of several (or infinite) values
Quiz: What Types of Variables? • In the an attentional sequence-to-sequence model using MLE/teacher forcing, are the following variables observed or latent? deterministic or random? • The input word ids f • The encoder hidden states h • The attention values a • The output word ids e
<latexit sha1_base64="V46Rh9MPIhydS0Zl5tSUxvmWbI=">AB63icbVBNSwMxEJ2tX7V+VT16CRbBg5TdKuix6MVjBfsB7VKyabYNTbJLkhXr0r/gxYMiXv1D3vw3Zts9aOuDgcd7M8zMC2LOtHdb6ewsrq2vlHcLG1t7+zulfcPWjpKFKFNEvFIdQKsKWeSNg0znHZiRbEIOG0H45vMbz9QpVk780kpr7AQ8lCRrDJpMcz9NQvV9yqOwNaJl5OKpCj0S9/9QYRSQSVhnCsdzY+OnWBlGOJ2WeomMSZjPKRdSyUWVPvp7NYpOrHKAIWRsiUNmqm/J1IstJ6IwHYKbEZ60cvE/7xuYsIrP2UyTgyVZL4oTDgyEcoeRwOmKDF8YgkmitlbERlhYmx8ZRsCN7iy8ukVat659Xa3UWlfp3HUYQjOIZT8OAS6nALDWgCgRE8wyu8OcJ5cd6dj3lrwclnDuEPnM8fhE+N5A=</latexit> <latexit sha1_base64="/KRA9KS/tvSXo7x6oOFxYVNHw=">AB9XicbVDJSgNBEO1xjXGLevTSGIQIEmaioOAl6MVjBLNAMoaeTk3SpGehu0aNQ/7DiwdFvPov3vwbO8tBEx8UPN6roqeF0uh0ba/rYXFpeWV1cxadn1jc2s7t7Nb01GiOFR5JCPV8JgGKUKokAJjVgBCzwJda9/NfLr96C0iMJbHMTgBqwbCl9whka6iwuPx08XtIU9QHbUzuXtoj0GnSfOlOTJFJV27qvViXgSQIhcMq2bjh2jmzKFgksYZluJhpjxPutC09CQBaDdHz1kB4apUP9SJkKkY7V3xMpC7QeBJ7pDBj29Kw3Ev/zmgn6524qwjhBCPlkZ9IihEdRUA7QgFHOTCEcSXMrZT3mGIcTVBZE4Iz+/I8qZWKzkmxdHOaL19O48iQfXJACsQhZ6RMrkmFVAknijyTV/JmPVgv1rv1MWldsKYze+QPrM8fRQSRtg=</latexit> Latent Variable Models • A latent variable model (LVM) is a probability distribution over two sets of variables : x, z p ( x, z ; θ ) where the x variables are observed at learning time in a dataset and z are latent variables.
What is Latent Random Variable Model • Older latent variable models • Topic models (unsupervised)
What is Latent Random Variable Model • Older latent variable models • Topic models (unsupervised) • Hidden Markov Model (unsupervised tagger)
What is Latent Random Variable Model • Older latent variable models • Topic models (unsupervised) • Hidden Markov Model (unsupervised tagger) • Some tree-structured Model (unsupervised parsing)
Why Latent Variable Models? • Some variables are not observed naturally and we want to model / infer these hidden variables: e.g. topics of an article • Specify structural relationships in the context of unknown variables, to learn interpretable structure: - Inject inductive bias / prior knowledge
Deep Structured Latent Variable Models • Specify structure, but interpretable structure is often discrete: e.g. POS tags, dependency parse trees • There is always a tradeo ff between interpretability and flexibility: model constraints v.s. model capacity
Examples of Deep Latent Variable Models • Deep latent variable models • Variational Autoencoders (VAEs) • Generative Adversarial Network (GANs) • Flow-based generative models
Variational Auto-encoders (Kingma and Welling 2014)
A Latent Variable Model • We observed output x (assume a continuous vector for now) • We have a latent variable z generated from a Gaussian • We have a function f, parameterized by Θ that maps from z to x , where this function is usually a neural net z ~ N (0, I) Θ x = f( z ; Θ ) x N
An Example (Goersch 2016) f z x
<latexit sha1_base64="bTwryeDNzmltFuegbWHI1b8uCLs=">AB6XicbVBNS8NAEJ34WetX1aOXxSJ4KkVFE9FLx6r2A9oQ9lsN+3SzSbsToQS+g+8eFDEq/Im/GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3U791hPXRsTqEcJ9yM6UCIUjKVHsR1r1R2K+4MZJl4OSlDjnqv9NXtxyNuEImqTEdz03Qz6hGwSfFLup4QlIzrgHUsVjbjxs9mlE3JqlT4JY21LIZmpvycyGhkzjgLbGVEcmkVvKv7ndVIMr/xMqCRFrth8UZhKgjGZvk36QnOGcmwJZVrYWwkbUk0Z2nCKNgRv8eVl0qxWvPNK9f6iXLvJ4yjAMZzAGXhwCTW4gzo0gEIz/AKb87IeXHenY9564qTzxzBHzifP04PjTU=</latexit> <latexit sha1_base64="Ur/I5DTSHCL8dILJ0zPUjty2s4=">AB9HicbVBNTwIxEJ3FL8Qv1KOXRmKCF7KLJnokevGIiYAJbEi3dKGh7a5tlwQ2/A4vHjTGqz/Gm/GAntQ8CWTvLw3k5l5QcyZNq7eTW1jc2t/LbhZ3dvf2D4uFRU0eJIrRBIh6pxwBrypmkDcMp4+xolgEnLaC4e3Mb42o0iySD2YcU1/gvmQhI9hYyZ90GepoJlBcnpx3iyW34s6BVomXkRJkqHeLX51eRBJBpSEca9323Nj4KVaGEU6nhU6iaYzJEPdp21KJBdV+Oj96is6s0kNhpGxJg+bq74kUC63HIrCdApuBXvZm4n9eOzHhtZ8yGSeGSrJYFCYcmQjNEkA9pigxfGwJorZWxEZYIWJsTkVbAje8surpFmteBeV6v1lqXaTxZGHEziFMnhwBTW4gzo0gMATPMrvDkj58V5dz4WrTknmzmGP3A+fwCuGJFi</latexit> <latexit sha1_base64="wOJ+X90BOhNazuCRDu548Non8Go=">ACAXicbVDLSgNBEJyNrxhfq14EL4NBiJewGwU9Br14jGAekCzL7GQ2GTL7YKZXEtd48Ve8eFDEq3/hzb9xkuxBEwsaiqpuru8WHAFlvVt5JaWV1bX8uFjc2t7R1zd6+hokRSVqeRiGTLI4oJHrI6cBCsFUtGAk+wpje4mvjNOyYVj8JbGMXMCUgv5D6nBLTkmgdDl+O4gGO3bQDfQZkXBo+3J+4ZtEqW1PgRWJnpIgy1Fzq9ONaBKwEKgSrVtKwYnJRI4FWxc6CSKxYQOSI+1NQ1JwJSTj8Y42OtdLEfSV0h4Kn6eyIlgVKjwNOdAYG+mvcm4n9eOwH/wkl5GCfAQjpb5CcCQ4QnceAul4yCGlCqOT6Vkz7RBIKOrSCDsGef3mRNCpl+7RcuTkrVi+zOPLoEB2hErLROaqia1RDdUTRI3pGr+jNeDJejHfjY9aM7KZfQHxucPNe2WvA=</latexit> <latexit sha1_base64="JQICHrLNf4VYVNWRQ5i4MTe5E4=">ACBnicbVDLSgNBEJyNrxhfqx5FGAxCAhJ2o6AXIejFYwTzgCSE2ckGTK7O8z0SpI1Jy/+ihcPinj1G7z5N04eB0saCiqunu8qTgGhzn20osLa+sriXUxubW9s79u5eWYeRoqxEQxGqkc0EzxgJeAgWFUqRnxPsIrXux7lXumNA+DOxhI1vBJ+BtTgkYqWkfykz/BA+z+BLzDArm3EdugzIKN/GabdtrJORPgReLOSBrNUGzaX/VWSCOfBUAF0brmOhIaMVHAqWCjVD3STBLaIx1WMzQgPtONePLGCB8bpYXboTIVAJ6ovydi4ms98D3T6RPo6nlvLP7n1SJoXzRiHsgIWECni9qRwBDicSa4xRWjIAaGEKq4uRXTLlGEgkuZUJw519eJOV8zj3N5W/P0oWrWRxJdICOUAa56BwV0A0qohKi6BE9o1f0Zj1ZL9a79TFtTVizmX30B9bnD+Ful4A=</latexit> A probabilistic perspective on Variational Auto-Encoder • For each datapoint i : • Draw latent variables z i ∼ p ( z ) (prior) Draw data point x i ∼ p θ ( x | z ) • • Joint probability distribution over data and latent variables: p ( x, z ) = p ( z ) p θ ( x | z )
What is Our Loss Function? • We would like to maximize the corpus log likelihood X log P ( X ) = log P ( x ; θ ) x ∈ X • For a single example, the marginal likelihood is Z P ( x ; θ ) = P ( x | z ; θ ) P ( z ) d z • We can approximate this by sampling z s then summing X S ( x ) := { z 0 ; z 0 ∼ P ( z ) } P ( x ; θ ) ≈ P ( x | z ; θ ) where z ∈ S ( x )
Recommend
More recommend