unsupervised learning
play

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department - PowerPoint PPT Presentation

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 1 / 81 Outline Unsupervised


  1. Convolutional Autoencoders Convolution + deconvolution layers: Decoder is a simplified DeconvNet [28] trained from scratch: Uppooling → upsampling (no need to remember max positions) Deconvolution → convolution Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 20 / 81

  2. Codes & Reconstructed x A 32 -bit code can roughly represents a 32 × 32 (1024 dimensional) MNIST image Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 21 / 81

  3. Manifolds I In many applications, data concentrate around one or more low-dimensional manifolds Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 22 / 81

  4. Manifolds I In many applications, data concentrate around one or more low-dimensional manifolds A manifold is a topological space that are linear locally Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 22 / 81

  5. Manifolds II For each point x on a manifold, we have its tangent space spanned by tangent vectors Local directions specify how one can change x infinitesimally while staying on the manifold Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 23 / 81

  6. Learning Manifolds I How to make c produced by autoencoders denote a coordinate of a dimensional manifold? Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 24 / 81

  7. Learning Manifolds I How to make c produced by autoencoders denote a coordinate of a dimensional manifold? Contractive autoencoder [20]: regularizes the code c such that it is invariant to local changes of x : 2 � � ∂ c ( n ) � � Ω ( c ) = ∑ � � ∂ x ( n ) � � n � � F ∂ c ( n ) / ∂ x ( n ) is a Jacobian matrix Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 24 / 81

  8. Learning Manifolds I How to make c produced by autoencoders denote a coordinate of a dimensional manifold? Contractive autoencoder [20]: regularizes the code c such that it is invariant to local changes of x : 2 � � ∂ c ( n ) � � Ω ( c ) = ∑ � � ∂ x ( n ) � � n � � F ∂ c ( n ) / ∂ x ( n ) is a Jacobian matrix Hence, c represents only the variations needed to reconstruct x I.e., c changes most along tangent vectors Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 24 / 81

  9. Learning Manifolds II In practice, it is easier to train a denoising autoencoder [26]: Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 25 / 81

  10. Learning Manifolds II In practice, it is easier to train a denoising autoencoder [26]: Encoder: to encode x with random noises Decoder: to reconstruct x without noises Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 25 / 81

  11. Getting Tangent Vectors I The code c represents a coordinate on a low dimensional manifold E.g., the blue line How to get the tangent vectors of a given c ? Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 26 / 81

  12. Getting Tangent Vectors II Recall: directions in the input space that changes c most should be tangent vectors Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

  13. Getting Tangent Vectors II Recall: directions in the input space that changes c most should be tangent vectors Given a point x , let c be the code of x and J ( x ) = ∂ c ∂ x be the Jacobian matrix of c at x Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

  14. Getting Tangent Vectors II Recall: directions in the input space that changes c most should be tangent vectors Given a point x , let c be the code of x and J ( x ) = ∂ c ∂ x be the Jacobian matrix of c at x J ( x ) summarizes how c changes in terms of x Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

  15. Getting Tangent Vectors II Recall: directions in the input space that changes c most should be tangent vectors Given a point x , let c be the code of x and J ( x ) = ∂ c ∂ x be the Jacobian matrix of c at x J ( x ) summarizes how c changes in terms of x Decompose J ( x ) using SVD 1 such that J ( x ) = UDV ⊤ Let tangent vectors be rows of 2 V corresponding to the largest singular values in D Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

  16. Getting Tangent Vectors III In practice, J ( x ) usually has few large singular values Tangent vectors found by contractive/denoising autoencoders: Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 28 / 81

  17. Getting Tangent Vectors III In practice, J ( x ) usually has few large singular values Tangent vectors found by contractive/denoising autoencoders: Can be used by Tangent Prop [23]: Let { v ( i , j ) } j be tangent vectors of each example x ( i ) Trains an NN classifier f with cost penalty: Ω [ f ] = ∑ i , j ∇ x f ( x ( i ) ) ⊤ v ( i , j ) Points in the same manifold share the same label Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 28 / 81

  18. Outline Unsupervised Learning 1 Self-Supervised Learning 2 Autoencoders & Manifold Learning 3 Generative Adversarial Networks 4 The Basics Challenges More GANs Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 29 / 81

  19. Decoder as Data Generator Decoder of an autoencoder can be used to generate data points even with synthetic codes Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

  20. Decoder as Data Generator Decoder of an autoencoder can be used to generate data points even with synthetic codes Problems: Same c , same output Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

  21. Decoder as Data Generator Decoder of an autoencoder can be used to generate data points even with synthetic codes Problems: Same c , same output → dropout layers, variational autoencoders [9] Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

  22. Decoder as Data Generator Decoder of an autoencoder can be used to generate data points even with synthetic codes Problems: Same c , same output → dropout layers, variational autoencoders [9] Blurry images Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

  23. Why Blurry Images? Cost function: argmin Θ − logP ( X | Θ ) = argmin Θ − ∑ n logP ( x ( n ) | Θ ) Image generation: linear output units a ( L ) = z ( L ) = ˆ µ for x ∼ N ( µ , Σ ) − logP ( x ( n ) | Θ ) = � x ( n ) − a ( L ) � 2 Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

  24. Why Blurry Images? Cost function: argmin Θ − logP ( X | Θ ) = argmin Θ − ∑ n logP ( x ( n ) | Θ ) Image generation: linear output units a ( L ) = z ( L ) = ˆ µ for x ∼ N ( µ , Σ ) − logP ( x ( n ) | Θ ) = � x ( n ) − a ( L ) � 2 Better assuming distribution for x ? Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

  25. Why Blurry Images? Cost function: argmin Θ − logP ( X | Θ ) = argmin Θ − ∑ n logP ( x ( n ) | Θ ) Image generation: linear output units a ( L ) = z ( L ) = ˆ µ for x ∼ N ( µ , Σ ) − logP ( x ( n ) | Θ ) = � x ( n ) − a ( L ) � 2 Better assuming distribution for x ? P ( x ) may be very complex Better “goodness” measure? Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

  26. Why Blurry Images? Cost function: argmin Θ − logP ( X | Θ ) = argmin Θ − ∑ n logP ( x ( n ) | Θ ) Image generation: linear output units a ( L ) = z ( L ) = ˆ µ for x ∼ N ( µ , Σ ) − logP ( x ( n ) | Θ ) = � x ( n ) − a ( L ) � 2 Better assuming distribution for x ? P ( x ) may be very complex Better “goodness” measure? Why not use an NN to tell if a generated image is of good quality? Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

  27. Outline Unsupervised Learning 1 Self-Supervised Learning 2 Autoencoders & Manifold Learning 3 Generative Adversarial Networks 4 The Basics Challenges More GANs Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 32 / 81

  28. Generative Adversarial Networks (GANs) Generative adversarial network ( GAN ) [4]: Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

  29. Generative Adversarial Networks (GANs) Generative adversarial network ( GAN ) [4]: Generator g : to generate data points from random codes No need for “encoder” since the task is data synthesis Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

  30. Generative Adversarial Networks (GANs) Generative adversarial network ( GAN ) [4]: Generator g : to generate data points from random codes No need for “encoder” since the task is data synthesis Discriminator f : to separate generated points from real ones Weights for x and ˆ x are tied A binary classifier with Sigmoid output unit a ( L ) = ˆ ρ for P ( y = true point | x ) ∼ Bernoulli ( ρ ) Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

  31. Generative Adversarial Networks (GANs) Generative adversarial network ( GAN ) [4]: Generator g : to generate data points from random codes No need for “encoder” since the task is data synthesis Discriminator f : to separate generated points from real ones Weights for x and ˆ x are tied A binary classifier with Sigmoid output unit a ( L ) = ˆ ρ for P ( y = true point | x ) ∼ Bernoulli ( ρ ) Goal: to train a g that tricks f into believing g ( c ) is real Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

  32. Cost Function Given N real training points and N generated points: argmin Θ g max Θ f logP ( X | Θ g , Θ f ) = argmin Θ g max Θ f ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) ))) Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

  33. Cost Function Given N real training points and N generated points: argmin Θ g max Θ f logP ( X | Θ g , Θ f ) = argmin Θ g max Θ f ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) ))) ρ ( n ) + ∑ N = argmin Θ g max Θ f ∑ N ρ ( m ) ) n = 1 log ˆ m = 1 log ( 1 − ˆ Recall that f maximizes the log likelihood � ρ ( n ) ) ( 1 − y ( n ) ) � logP ( X | Θ ) ∝ ∑ n logP ( y ( n ) | x ( n ) , Θ ) = ∑ n log ρ ( n ) ) y ( n ) ( 1 − ˆ ( ˆ Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

  34. Cost Function Given N real training points and N generated points: argmin Θ g max Θ f logP ( X | Θ g , Θ f ) = argmin Θ g max Θ f ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) ))) ρ ( n ) + ∑ N = argmin Θ g max Θ f ∑ N ρ ( m ) ) n = 1 log ˆ m = 1 log ( 1 − ˆ Recall that f maximizes the log likelihood � ρ ( n ) ) ( 1 − y ( n ) ) � logP ( X | Θ ) ∝ ∑ n logP ( y ( n ) | x ( n ) , Θ ) = ∑ n log ρ ( n ) ) y ( n ) ( 1 − ˆ ( ˆ Inner max first, then outer min Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

  35. Cost Function Given N real training points and N generated points: argmin Θ g max Θ f logP ( X | Θ g , Θ f ) = argmin Θ g max Θ f ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) ))) ρ ( n ) + ∑ N = argmin Θ g max Θ f ∑ N ρ ( m ) ) n = 1 log ˆ m = 1 log ( 1 − ˆ Recall that f maximizes the log likelihood � ρ ( n ) ) ( 1 − y ( n ) ) � logP ( X | Θ ) ∝ ∑ n logP ( y ( n ) | x ( n ) , Θ ) = ∑ n log ρ ( n ) ) y ( n ) ( 1 − ˆ ( ˆ Inner max first, then outer min ρ ( n ) depends on Θ f only ˆ ρ ( m ) depends on both Θ f and Θ g ˆ Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

  36. Training: Alternative SGD log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Initialize Θ g for g and Θ f for f At each SGD step/iteration: Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

  37. Training: Alternative SGD log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Initialize Θ g for g and Θ f for f At each SGD step/iteration: Repeat K times (with fixed Θ g ): 1 Sample N real points { x ( n ) } n from X and N codes from c ∼ N ( 0 , I ) 1 Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] 2 Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

  38. Training: Alternative SGD log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Initialize Θ g for g and Θ f for f At each SGD step/iteration: Repeat K times (with fixed Θ g ): 1 Sample N real points { x ( n ) } n from X and N codes from c ∼ N ( 0 , I ) 1 Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] 2 Execute once (with fixed Θ f ): 2 Sample N codes from c ∼ N ( 0 , I ) 1 Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] 2 Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

  39. Training: Alternative SGD log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Initialize Θ g for g and Θ f for f At each SGD step/iteration: Repeat K times (with fixed Θ g ): 1 Sample N real points { x ( n ) } n from X and N codes from c ∼ N ( 0 , I ) 1 Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] 2 Execute once (with fixed Θ f ): 2 Sample N codes from c ∼ N ( 0 , I ) 1 Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] 2 Why limiting the steps ( K ) when updating Θ f ? Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

  40. Training: Alternative SGD log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Initialize Θ g for g and Θ f for f At each SGD step/iteration: Repeat K times (with fixed Θ g ): 1 Sample N real points { x ( n ) } n from X and N codes from c ∼ N ( 0 , I ) 1 Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] 2 Execute once (with fixed Θ f ): 2 Sample N codes from c ∼ N ( 0 , I ) 1 Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] 2 Why limiting the steps ( K ) when updating Θ f ? f may overfit data and give very different values once g is updated Limiting K so to prevent g from being updated for “wrong” target Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

  41. Results Domain-specific architecture, e.g., DC-GAN [18] Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 36 / 81

  42. Results Domain-specific architecture, e.g., DC-GAN [18] Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 36 / 81

  43. GANs Are Hard to Train! Tips for Training Stable GANs Keep Calm and train a GAN. Pitfalls and Tips... 10 Lessons I Learned Training GANs for one Year GAN hacks on GitHub Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 37 / 81

  44. Outline Unsupervised Learning 1 Self-Supervised Learning 2 Autoencoders & Manifold Learning 3 Generative Adversarial Networks 4 The Basics Challenges More GANs Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 38 / 81

  45. Challenge: Non-Convergence The GAN training may not converge Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

  46. Challenge: Non-Convergence The GAN training may not converge The goal of GAN is to find a saddle point log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

  47. Challenge: Non-Convergence The GAN training may not converge The goal of GAN is to find a saddle point log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m The updated Θ f and Θ g may cancel each other’s progress Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

  48. Challenge: Non-Convergence The GAN training may not converge The goal of GAN is to find a saddle point log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m The updated Θ f and Θ g may cancel each other’s progress Requires human monitoring and termination Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

  49. Mode Collapsing Even worse: mode collapsing g may oscillate from generating one kind of points to generating another kind of points Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 40 / 81

  50. Mode Collapsing Even worse: mode collapsing g may oscillate from generating one kind of points to generating another kind of points When K is small, alternate SGD does not distinguish between min Θ g max Θ f and max Θ f min Θ g log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m max Θ f min Θ g ? Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 40 / 81

  51. Mode Collapsing Even worse: mode collapsing g may oscillate from generating one kind of points to generating another kind of points When K is small, alternate SGD does not distinguish between min Θ g max Θ f and max Θ f min Θ g log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m max Θ f min Θ g ? g is encouraged to map every code to the “mode” that f believes is most likely to be real Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 40 / 81

  52. Solutions Minibatch discrimination [22] In max Θ f min Θ g case, g collapses because ∇ Θ f C are computed independently for each point x ( n ) with batch features ? Why not augment each x ( n ) / ˆ Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

  53. Solutions Minibatch discrimination [22] In max Θ f min Θ g case, g collapses because ∇ Θ f C are computed independently for each point x ( n ) with batch features ? Why not augment each x ( n ) / ˆ If g collapses, f can tell this from batch features and reject fake points Now, g needs to generate dissimilar points to fool f Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

  54. Solutions Minibatch discrimination [22] In max Θ f min Θ g case, g collapses because ∇ Θ f C are computed independently for each point x ( n ) with batch features ? Why not augment each x ( n ) / ˆ If g collapses, f can tell this from batch features and reject fake points Now, g needs to generate dissimilar points to fool f without with Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

  55. Solutions Minibatch discrimination [22] In max Θ f min Θ g case, g collapses because ∇ Θ f C are computed independently for each point x ( n ) with batch features ? Why not augment each x ( n ) / ˆ If g collapses, f can tell this from batch features and reject fake points Now, g needs to generate dissimilar points to fool f without with Unrolled GANs [15]: to back-propagate through several max steps when computing ∇ Θ g C Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

  56. Challenge: Balance between g and f log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Alternate SGD: Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] for K times Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

  57. Challenge: Balance between g and f log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Alternate SGD: Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] for K times Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] Why limiting K when updating Θ f ? Too large K : Too small K : Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

  58. Challenge: Balance between g and f log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Alternate SGD: Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] for K times Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] Why limiting K when updating Θ f ? Too large K : f may overfit data, making g updated for “wrong” target f Too small K : Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

  59. Challenge: Balance between g and f log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Alternate SGD: Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] for K times Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] Why limiting K when updating Θ f ? Too large K : f may overfit data, making g updated for “wrong” target f Vanishing gradients: ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] too small to learn Too small K : Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

  60. Challenge: Balance between g and f log f ( x ( n ) )+ ∑ log ( 1 − f ( g ( c ( m ) ))) Θ f ∑ argmin Θ g max n m Alternate SGD: Θ f ← Θ f + η ∇ Θ f [ ∑ n log f ( x ( n ) )+ ∑ m log ( 1 − f ( g ( c ( m ) )))] for K times Θ g ← Θ g − η ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] Why limiting K when updating Θ f ? Too large K : f may overfit data, making g updated for “wrong” target f Vanishing gradients: ∇ Θ g [ ∑ m log ( 1 − f ( g ( c ( m ) )))] too small to learn Too small K : g updated for “meaningless” f Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

  61. Solution: Wasserstein GAN [1] Let f be a regressor without the sigmoid output layer Cost function: f ( x ( n ) ) − ∑ f ( g ( c ( m ) )) Θ f ∑ argmin Θ g max n m Initialize Θ g for g and Θ f for f At each SGD step/iteration: Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 43 / 81

Recommend


More recommend