When do neural networks outperform kernel methods? Song Mei Stanford University June 29, 2020 Joint work with Behrooz Ghorbani, Theodor Misiakiewicz, and Andrea Montanari Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 1 / 15
Neural tangent model ◮ Multi-layers NN : ❢ ◆ ✭ x ❀ θ ✮ , x ✷ R ❞ , θ ✷ R ◆ ◮ Expanding around θ ✵ : ❢ ◆ ✭ x ❀ θ ✮ ❂ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✰ ❤ θ � θ ✵ ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✰ ♦ ✭ ❦ θ � θ ✵ ❦ ✷ ✮ ✿ ◮ Neural tangent model: ❢ NT ❀◆ ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✿ ◮ Coupled gradient flow: ❞ ❞ t θ t ❂ � r θ ❫ θ ✵ ❂ θ ✵ ❀ E ❬✭ ② � ❢ ◆ ✭ x ❀ θ t ✮✮ ✷ ❪ ❀ ❞ ❞ t β t ❂ � r β ❫ β ✵ ❂ 0 ✿ E ❬✭ ② � ❢ NT ❀◆ ✭ x ❀ β t ❀ θ ✵ ✮✮ ✷ ❪ ❀ ◮ Under proper initialization and over-parameterization: ◆ ✦✶ ❥ ❢ ◆ ✭ x ❀ θ t ✮ � ❢ NT ❀◆ ✭ x ❀ β t ✮ ❥ ❂ ✵ ✿ ❧✐♠ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], .... Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15
Neural tangent model ◮ Multi-layers NN : ❢ ◆ ✭ x ❀ θ ✮ , x ✷ R ❞ , θ ✷ R ◆ ◮ Expanding around θ ✵ : ❢ ◆ ✭ x ❀ θ ✮ ❂ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✰ ❤ θ � θ ✵ ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✰ ♦ ✭ ❦ θ � θ ✵ ❦ ✷ ✮ ✿ ◮ Neural tangent model: ❢ NT ❀◆ ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✿ ◮ Coupled gradient flow: ❞ ❞ t θ t ❂ � r θ ❫ θ ✵ ❂ θ ✵ ❀ E ❬✭ ② � ❢ ◆ ✭ x ❀ θ t ✮✮ ✷ ❪ ❀ ❞ ❞ t β t ❂ � r β ❫ β ✵ ❂ 0 ✿ E ❬✭ ② � ❢ NT ❀◆ ✭ x ❀ β t ❀ θ ✵ ✮✮ ✷ ❪ ❀ ◮ Under proper initialization and over-parameterization: ◆ ✦✶ ❥ ❢ ◆ ✭ x ❀ θ t ✮ � ❢ NT ❀◆ ✭ x ❀ β t ✮ ❥ ❂ ✵ ✿ ❧✐♠ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], .... Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15
Neural tangent model ◮ Multi-layers NN : ❢ ◆ ✭ x ❀ θ ✮ , x ✷ R ❞ , θ ✷ R ◆ ◮ Expanding around θ ✵ : ❢ ◆ ✭ x ❀ θ ✮ ❂ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✰ ❤ θ � θ ✵ ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✰ ♦ ✭ ❦ θ � θ ✵ ❦ ✷ ✮ ✿ ◮ Neural tangent model: ❢ NT ❀◆ ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✿ ◮ Coupled gradient flow: ❞ ❞ t θ t ❂ � r θ ❫ θ ✵ ❂ θ ✵ ❀ E ❬✭ ② � ❢ ◆ ✭ x ❀ θ t ✮✮ ✷ ❪ ❀ ❞ ❞ t β t ❂ � r β ❫ β ✵ ❂ 0 ✿ E ❬✭ ② � ❢ NT ❀◆ ✭ x ❀ β t ❀ θ ✵ ✮✮ ✷ ❪ ❀ ◮ Under proper initialization and over-parameterization: ◆ ✦✶ ❥ ❢ ◆ ✭ x ❀ θ t ✮ � ❢ NT ❀◆ ✭ x ❀ β t ✮ ❥ ❂ ✵ ✿ ❧✐♠ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], .... Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15
Neural tangent model ◮ Multi-layers NN : ❢ ◆ ✭ x ❀ θ ✮ , x ✷ R ❞ , θ ✷ R ◆ ◮ Expanding around θ ✵ : ❢ ◆ ✭ x ❀ θ ✮ ❂ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✰ ❤ θ � θ ✵ ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✰ ♦ ✭ ❦ θ � θ ✵ ❦ ✷ ✮ ✿ ◮ Neural tangent model: ❢ NT ❀◆ ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✿ ◮ Coupled gradient flow: ❞ ❞ t θ t ❂ � r θ ❫ θ ✵ ❂ θ ✵ ❀ E ❬✭ ② � ❢ ◆ ✭ x ❀ θ t ✮✮ ✷ ❪ ❀ ❞ ❞ t β t ❂ � r β ❫ β ✵ ❂ 0 ✿ E ❬✭ ② � ❢ NT ❀◆ ✭ x ❀ β t ❀ θ ✵ ✮✮ ✷ ❪ ❀ ◮ Under proper initialization and over-parameterization: ◆ ✦✶ ❥ ❢ ◆ ✭ x ❀ θ t ✮ � ❢ NT ❀◆ ✭ x ❀ β t ✮ ❥ ❂ ✵ ✿ ❧✐♠ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], .... Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15
Neural tangent model ◮ Multi-layers NN : ❢ ◆ ✭ x ❀ θ ✮ , x ✷ R ❞ , θ ✷ R ◆ ◮ Expanding around θ ✵ : ❢ ◆ ✭ x ❀ θ ✮ ❂ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✰ ❤ θ � θ ✵ ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✰ ♦ ✭ ❦ θ � θ ✵ ❦ ✷ ✮ ✿ ◮ Neural tangent model: ❢ NT ❀◆ ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ r θ ❢ ◆ ✭ x ❀ θ ✵ ✮ ✐ ✿ ◮ Coupled gradient flow: ❞ ❞ t θ t ❂ � r θ ❫ θ ✵ ❂ θ ✵ ❀ E ❬✭ ② � ❢ ◆ ✭ x ❀ θ t ✮✮ ✷ ❪ ❀ ❞ ❞ t β t ❂ � r β ❫ β ✵ ❂ 0 ✿ E ❬✭ ② � ❢ NT ❀◆ ✭ x ❀ β t ❀ θ ✵ ✮✮ ✷ ❪ ❀ ◮ Under proper initialization and over-parameterization: ◆ ✦✶ ❥ ❢ ◆ ✭ x ❀ θ t ✮ � ❢ NT ❀◆ ✭ x ❀ β t ✮ ❥ ❂ ✵ ✿ ❧✐♠ [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Chizat, Bach, 2018b], .... Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 2 / 15
How about generalization? ◮ [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019]: Cifar10 experiments. NT: ✷✸✪ test error. NN: less than ✺✪ test error. ◮ [Arora, Du, Li, Salakhutdinov, Wang, Yu, 2019]: Small dataset, NT sometimes generalize better than NN. ◮ [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020] [Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019]: Smaller gap between NT and NN on Cifar10 (10 ✪ for NT). Sometimes there is a large gap, while sometimes the gap is small. Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 3 / 15
How about generalization? ◮ [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019]: Cifar10 experiments. NT: ✷✸✪ test error. NN: less than ✺✪ test error. ◮ [Arora, Du, Li, Salakhutdinov, Wang, Yu, 2019]: Small dataset, NT sometimes generalize better than NN. ◮ [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020] [Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019]: Smaller gap between NT and NN on Cifar10 (10 ✪ for NT). Sometimes there is a large gap, while sometimes the gap is small. Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 3 / 15
Focus of this talk When is there a large performance gap between NN and NT? Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 4 / 15
Two-layers neural networks Neural networks: ◆ ♥ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ✿ ❛ ✐ ✷ R ❀ w ✐ ✷ R ❞ ♦ ❳ ❋ NN ❀◆ ❂ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ✿ ✐ ❂✶ Linearization: ◆ ◆ ❳ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❢ ◆ ✭ x ❀ Θ ✵ ✮ ✰ ✁ ❛ ✐ ✛ ✭ ❤ w ✵ ❛ ✵ ✐ ✛ ✵ ✭ ❤ w ✵ ✐ ❀ x ✐ ✮ ✰ ✐ ❀ x ✐ ✮ ❤ ✁ w ✐ ❀ x ✐ ✰ ♦ ✭ ✁ ✮ ✿ ✐ ❂✶ ✐ ❂✶ ⑤ ④③ ⑥ ⑤ ④③ ⑥ Top layer linearization Bottom layer linearization Linearized neural networks ( W ❂ ✭ w ✐ ✮ ✐ ✷ ❬ ◆ ❪ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ): ◆ ♥ ♦ ❳ ❋ RF ❀◆ ✭ W ✮ ❂ ❢ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ✿ ❛ ✐ ✷ R ❀ ✐ ✷ ❬ ◆ ❪ ❀ ✐ ❂✶ ◆ ♥ ♦ ❳ ✛ ✵ ✭ ❤ w ✐ ❀ x ✐ ✮ ❤ b ✐ ❀ x ✐ ✿ b ✐ ✷ R ❞ ❀ ✐ ✷ ❬ ◆ ❪ ❋ NT ❀◆ ✭ W ✮ ❂ ❢ ❂ ✿ ✐ ❂✶ Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 5 / 15
Two-layers neural networks Neural networks: ◆ ♥ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ✿ ❛ ✐ ✷ R ❀ w ✐ ✷ R ❞ ♦ ❳ ❋ NN ❀◆ ❂ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ✿ ✐ ❂✶ Linearization: ◆ ◆ ❳ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❢ ◆ ✭ x ❀ Θ ✵ ✮ ✰ ✁ ❛ ✐ ✛ ✭ ❤ w ✵ ❛ ✵ ✐ ✛ ✵ ✭ ❤ w ✵ ✐ ❀ x ✐ ✮ ✰ ✐ ❀ x ✐ ✮ ❤ ✁ w ✐ ❀ x ✐ ✰ ♦ ✭ ✁ ✮ ✿ ✐ ❂✶ ✐ ❂✶ ⑤ ④③ ⑥ ⑤ ④③ ⑥ Top layer linearization Bottom layer linearization Linearized neural networks ( W ❂ ✭ w ✐ ✮ ✐ ✷ ❬ ◆ ❪ ✘ ✐✐❞ ❯♥✐❢✭ S ❞ � ✶ ✮ ): ◆ ♥ ♦ ❳ ❋ RF ❀◆ ✭ W ✮ ❂ ❢ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ✿ ❛ ✐ ✷ R ❀ ✐ ✷ ❬ ◆ ❪ ❀ ✐ ❂✶ ◆ ♥ ♦ ❳ ✛ ✵ ✭ ❤ w ✐ ❀ x ✐ ✮ ❤ b ✐ ❀ x ✐ ✿ b ✐ ✷ R ❞ ❀ ✐ ✷ ❬ ◆ ❪ ❋ NT ❀◆ ✭ W ✮ ❂ ❢ ❂ ✿ ✐ ❂✶ Song Mei (Stanford University) Neural Networks and Kernel Methods June 29, 2020 5 / 15
Recommend
More recommend