12 unsupervised deep learning
play

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 - PowerPoint PPT Presentation

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from Wanli Ouyang, Zsolt Kira, Lawrence Neal, Raymond Yeh, Junting Lou and Teck-Yian Lim Unsupervised Learning in General Unsupervised learning is


  1. Sliced Wasserstein Distance • For each image, build a Laplacian pyramid • Sample many patches from these pyramids • Normalize them by their mean/variance • Yields R/G/B histograms at each scale • Measures difference in distributions • Used as SWD(real_images, generated_images) • Lower score (more similarity) is better • Measures realism

  2. Experiments

  3. Nearest Neighbors Comparison with training set images

  4. Results

  5. Boltzmann Machine (Fully-connected MRF/CRF) • Undirected graphical model • Binary values on each variable • • • Consider only binary interactions    )     E ( x; w x x x ij i j i i  i j i   f ( x ; ) m m m    E ( x ; ) e f ( x ; )     P ( x ; ) m ,        E ( x ; ) f ( x ; ) e Z ( ) m m m x x m  w  : { , } ij i Boltzmann machine: � � � � �

  6. Restricted Boltzmann Machines • We restrict the connectivity to make j hidden inference and learning easier. • Only one layer of hidden units. visible i • No connections between hidden units. 1  p ( h ) • In an RBM it only takes one step to  1 ( ) j reach thermal equilibrium when the    b v w visible units are clamped. j i ij  1 e  i vis • So we can quickly get the exact value of :   v i h j v

  7. What you gain

  8. Example: ShapeBM (Eslami et al. 2012) • Generating shapes • 2-layer RBM with local connections • Learning from many horses

  9. Training: Contrastive divergence Start with a training vector on the j j visible units.  v i h j  0  v i h j  1 Update all the hidden units in parallel. i i Update the all the visible units in t = 0 t = 1 parallel to get a “ reconstruction ” . reconstruction data Update the hidden units again . D w ij  e (  v i h j  0   v i h j  1 ) This is not following the gradient of the log likelihood. But it works well.

  10. 28x28 Layerwise Pretraining T W 1 (Hinton & Salakhutdinov, 2006) 1000 neurons • They always looked like a really T W 2 nice way to do non-linear 500 neurons T dimensionality reduction: W 3 • But it is very difficult to 250 neurons T optimize deep autoencoders W 4 linear using backpropagation. 30 units • We now have a much better W 4 way to optimize them: 250 neurons • First train a stack of 4 RBM’s W 3 • Then “unroll” them. 500 neurons • Then fine-tune with backprop. W 2 1000 neurons W 1 28x28 100

Recommend


More recommend