Understanding Noise in Machine Learning Runtian Zhai School of Electronics Engineering and Computer Science School of Mathematical Science (double major) Peking University zhairuntian@pku.edu.cn July 16, 2019 A talk at UCLA. Runtian Zhai (PKU) Understanding Noise July 16, 2019 1 / 45
Introduction: Noise is Everywhere Noise is everywhere. Since the time as early as the 1920s, statisticians have been searching for ways to combat noise in collected data. In machine learning, this is a more serious problem. Training sets are labeled by humans, and humans always make mistakes. In computer vision, many images are corrupted, blurred, or compressed. An even more dangerous kind of noise is known as the adversarial examples , crafted noise that aims to fool a certain classifjer. Runtian Zhai (PKU) Understanding Noise July 16, 2019 2 / 45
Learning with Noise People have proposed various kinds of ways to learn with noise: Some propose detection methods which detect noisy samples in a dataset so that they can be removed. Others suggest that even noisy samples can be useful. For instance, co-teaching is proposed to help networks learn on noisy datasets. Many defense methods are proposed to fjght against adversarial examples. The most successful one so far is adversarial training , which is training with on-the-fmy adversarial examples. Runtian Zhai (PKU) Understanding Noise July 16, 2019 3 / 45
How Noise Afgects Training Detecting Noise July 16, 2019 Understanding Noise Runtian Zhai (PKU) Fourier Analysis Neural Tangent Kernel Infmuence Function More Theoretical Approaches 3 Measuring Dataset Complexity Outline Zhang’s Experiment and Its Explanation Noise Fits More Slowly Than Clean Data 2 Two Phases of Learning Difgerent Kinds of Noise are Difgerent Critical Learning Periods How Noise Afgects Training 1 4 / 45
How Noise Afgects Training Critical Learning Periods Critical Learning Periods Critical Learning Periods in Deep Networks Achille et al. (UCLA) [7] In ICLR 2019 In Biology, we are told that the fjrst several weeks after the birth of a baby animal, known as the critical learning period , is critical for its intellectual development. In deep learning this is also true. If a network is trained on noisy images during the fjrst several epochs, then it can never reach high performance even if it is trained on clean images later on. Runtian Zhai (PKU) Understanding Noise July 16, 2019 5 / 45
How Noise Afgects Training Critical Learning Periods Experiment I To show that the biological behavior also exists in deep learning, the authors did the following experiment: They trained an All-CNN on Cifar-10. During the fjrst N epochs the network was trained on noisy images. After that the network was trained on clean images for another 160 epochs. They used blurred images for noisy images: fjrst downsample the bilinear interpolation. Runtian Zhai (PKU) Understanding Noise July 16, 2019 6 / 45 32 × 32 images to 8 × 8 and then upsample back to 32 × 32 with
How Noise Afgects Training Critical Learning Periods Result: Early Defjcit Has Irremediable Negative Efgect Runtian Zhai (PKU) Understanding Noise July 16, 2019 7 / 45
How Noise Afgects Training Critical Learning Periods Experiment II They trained the network on noisy images for 40 epochs starting from epoch N , and on clean images in other epochs. The 40 epochs is called the defjcit window. They tested how much test accuracy would decrease with difgerent choice of window onset N . Runtian Zhai (PKU) Understanding Noise July 16, 2019 8 / 45
How Noise Afgects Training Critical Learning Periods Result: Early Epochs are More Critical Runtian Zhai (PKU) Understanding Noise July 16, 2019 9 / 45
How Noise Afgects Training Difgerent Kinds of Noise are Difgerent Difgerent Kinds of Noise The authors repeated the fjrst experiment with difgerent kinds of noise: Vertical fmip: Flip the images vertically Label permutation: Use random labels Noise: All images are replaced by random noise. They also tested networks of difgerent depth. Runtian Zhai (PKU) Understanding Noise July 16, 2019 10 / 45 Blur: 32 × 32 images downsampled to 8 × 8 then upsampled to 32 × 32 with bilinear interpolation
How Noise Afgects Training Difgerent Kinds of Noise are Difgerent Difgerent Kinds of Noise are Difgerent For Noise the efgect is not as strong. For Vertical fmip and Label permutation, the efgect is very weak. The deeper the network, the stronger the efgect. Runtian Zhai (PKU) Understanding Noise July 16, 2019 11 / 45
How Noise Afgects Training Two Phases of Learning Two Phases of Learning The authors did fjsher information analysis on the training process: They used the trace of Fisher Information Matrix (FIM) to measure how much information the network had learned. The training period has two phases: In Phase I, FIM rises quickly, showing that the network is learning; In Phase II, FIM drops dramatically (while its performance is still improving), showing that the network starts to forget. Runtian Zhai (PKU) Understanding Noise July 16, 2019 12 / 45
How Noise Afgects Training Two Phases of Learning Two Phases of Learning (cont.) Many other papers [10, 11] also found that there are two phases during training from the optimization perspective. It is well known that during training, noise fjts more slowly than clean data. Many recent papers argue that in phase I, the network is fjtting clean data; in phase II, the network is fjtting noise, so it seems like the network is forgetting useful information. This also explains why early stopping works. Runtian Zhai (PKU) Understanding Noise July 16, 2019 13 / 45
Noise Fits More Slowly Than Clean Data Detecting Noise July 16, 2019 Understanding Noise Runtian Zhai (PKU) Fourier Analysis Neural Tangent Kernel Infmuence Function More Theoretical Approaches 3 Measuring Dataset Complexity Outline Zhang’s Experiment and Its Explanation Noise Fits More Slowly Than Clean Data 2 Two Phases of Learning Difgerent Kinds of Noise are Difgerent Critical Learning Periods How Noise Afgects Training 1 14 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation Deep Networks Can Fit Random Labels Understanding Deep Learning Requires Rethinking Generalization Zhang et al. [5] In ICLR 2017 In this paper, the authors did the following experiment: they added many kinds of noise to Cifar-10 (random labels, random pixels, gaussian, etc.), and then trained an Inception model on it. They found that Deep networks fjt noisy data easily. However, it takes much longer time than clean data. Runtian Zhai (PKU) Understanding Noise July 16, 2019 15 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation Explaining the Results Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks Arora et al. [4] In ICML 2019 In this paper, the authors prove for an overparameterized two-layer fully-connected network that GD (gradient descent) can converge (achieve zero training loss) on datasets with random labels. GD converges more slowly on random labels than on clean labels. Label noise can harm generalization. Runtian Zhai (PKU) Understanding Noise July 16, 2019 16 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation July 16, 2019 Understanding Noise Runtian Zhai (PKU) value of W at step k . 17 / 45 1 m A two-layer ReLU network with m neurons is Basic Setting ∑ a r σ ( w ⊤ f W , a ( x ) = √ m r x ) r = 1 where x ∈ R d is the input, W = ( w 1 , ..., w m ) ∈ R d × m is the weight of the fjrst layer and a = ( a 1 , ..., a m ) ⊤ ∈ R m is the weight of the second layer. Assume ∥ x ∥ 2 = 1 and | y | ≤ 1. At initialization, w r ( 0 ) ∼ N ( 0 , κ 2 I ) , a r ∼ unif ( {− 1 , 1 } ) . Fix the second layer a and only train the fjrst layer W . Denote W ( k ) as the
Noise Fits More Slowly Than Clean Data 2 , where July 16, 2019 Understanding Noise Runtian Zhai (PKU) Zhang’s Experiment and Its Explanation 18 / 45 n 2 Trajectory Based Analysis Use MSE (Mean Square Error) as the loss function: ∑ Φ( W ) = 1 ( y i − f W , a ( x i )) 2 i = 1 Let the trajectory of the network be u = ( u 1 , ..., u n ) ⊤ , where u i = f W , a ( x i ) . Then the loss function is Φ( W ) = 1 2 ∥ y − u ∥ 2 y = ( y 1 , ..., y n ) ⊤ . Train with GD with learning rate η . Defjne H ∞ as a Gram matrix : H ∞ ij = E w ∼N ( 0 , I ) [ x ⊤ i x j I { w ⊤ x i ≥ 0 , w ⊤ x j ≥ 0 } ] = x ⊤ i x j ( π − arccos ( x ⊤ i x j )) , ∀ i , j ∈ [ n ] 2 π
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation July 16, 2019 Understanding Noise Runtian Zhai (PKU) 2 2 2 19 / 45 Main Theorem enough, and the width m is large enough. Lemma: Under the above assumptions, during training the real Assumptions: The initial variance κ 2 and learning rate η are small trajectory { u ( k ) } ∞ u ( k ) } ∞ k = 0 stays close to another sequence { ˜ k = 0 which u ( k ) − η H ∞ (˜ u ( k ) − y ) . By has a linear update rule: ˜ u ( k + 1 ) = ˜ analyzing the dynamics of ˜ u ( k ) we can prove that � � � ( I − η H ∞ ) k y Φ( W ( k )) ≈ 1 � � � uniformly for all k ≥ 0 with high probability. If H ∞ is positive defjnite, we can be sure that Φ( W ( k )) → 0 as k → ∞ , which implies that GD always converges even if y is random.
Recommend
More recommend