The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016 Joint work with Yu Bai and Andrea Montanari Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 1 / 17
Binary linear classification The model Z ✐ ❂ ✭ X ✐ ❀ ❨ ✐ ✮ . X ✐ ✷ R ❞ , ❨ ✐ ✷ ❢ ✵ ❀ ✶ ❣ , i = 1,. . . , n. Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 2 / 17
Non-convex formulation of binary classification The model Z ✐ ❂ ✭ X ✐ ❀ ❨ ✐ ✮ . X ✐ ✷ R ❞ , ❨ ✐ ✷ ❢ ✵ ❀ ✶ ❣ , i = 1,. . . , n. ◮ Convex logit loss ( ❵ ❝ is cvx in θ ) ✏ ✑ ❵ ❝ ✭ θ ❀ Z ✮ ❂ ❨ ❤ X ❀ θ ✐ � ❧♦❣ ✶ ✰ ❡①♣✭ ❤ X ❀ θ ✐ ✮ ✿ ◮ Non-convex loss ( ❵ is not cvx in θ ) ✏ ✑ ✷ ❵ ✭ θ ❀ Z ✮ ❂ ❨ � ✛ ✭ ❤ X ❀ θ ✐ ✮ ❀ where ✛ ✭ t ✮ ❂ ✶ ❂ ✭✶ ✰ ❡①♣✭ t ✮✮ ✿ ◮ Empirical Risk ♥ ❳ ❘ ♥ ✭ θ ✮ ❂ ✶ ❜ ❵ ✭ θ ❀ Z ✐ ✮ ✿ ♥ ✐ ❂✶ ◮ Empirical risk minimizer ❫ ❜ θ ♥ ❂ ❛r❣ ♠✐♥ ❘ ♥ ✭ θ ✮ ✿ θ ✷ B ❞ ✭ ❘ ✮ Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17
Non-convex formulation of binary classification The model Z ✐ ❂ ✭ X ✐ ❀ ❨ ✐ ✮ . X ✐ ✷ R ❞ , ❨ ✐ ✷ ❢ ✵ ❀ ✶ ❣ , i = 1,. . . , n. ◮ Convex logit loss ( ❵ ❝ is cvx in θ ) ✏ ✑ ❵ ❝ ✭ θ ❀ Z ✮ ❂ ❨ ❤ X ❀ θ ✐ � ❧♦❣ ✶ ✰ ❡①♣✭ ❤ X ❀ θ ✐ ✮ ✿ ◮ Non-convex loss ( ❵ is not cvx in θ ) ✏ ✑ ✷ ❵ ✭ θ ❀ Z ✮ ❂ ❨ � ✛ ✭ ❤ X ❀ θ ✐ ✮ ❀ where ✛ ✭ t ✮ ❂ ✶ ❂ ✭✶ ✰ ❡①♣✭ t ✮✮ ✿ ◮ Empirical Risk ♥ ❳ ❘ ♥ ✭ θ ✮ ❂ ✶ ❜ ❵ ✭ θ ❀ Z ✐ ✮ ✿ ♥ ✐ ❂✶ ◮ Empirical risk minimizer ❫ ❜ θ ♥ ❂ ❛r❣ ♠✐♥ ❘ ♥ ✭ θ ✮ ✿ θ ✷ B ❞ ✭ ❘ ✮ Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17
Non-convex formulation of binary classification The model Z ✐ ❂ ✭ X ✐ ❀ ❨ ✐ ✮ . X ✐ ✷ R ❞ , ❨ ✐ ✷ ❢ ✵ ❀ ✶ ❣ , i = 1,. . . , n. ◮ Convex logit loss ( ❵ ❝ is cvx in θ ) ✏ ✑ ❵ ❝ ✭ θ ❀ Z ✮ ❂ ❨ ❤ X ❀ θ ✐ � ❧♦❣ ✶ ✰ ❡①♣✭ ❤ X ❀ θ ✐ ✮ ✿ ◮ Non-convex loss ( ❵ is not cvx in θ ) ✏ ✑ ✷ ❵ ✭ θ ❀ Z ✮ ❂ ❨ � ✛ ✭ ❤ X ❀ θ ✐ ✮ ❀ where ✛ ✭ t ✮ ❂ ✶ ❂ ✭✶ ✰ ❡①♣✭ t ✮✮ ✿ ◮ Empirical Risk ♥ ❳ ❘ ♥ ✭ θ ✮ ❂ ✶ ❜ ❵ ✭ θ ❀ Z ✐ ✮ ✿ ♥ ✐ ❂✶ ◮ Empirical risk minimizer ❫ ❜ θ ♥ ❂ ❛r❣ ♠✐♥ ❘ ♥ ✭ θ ✮ ✿ θ ✷ B ❞ ✭ ❘ ✮ Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17
Non-convex formulation of binary classification The model Z ✐ ❂ ✭ X ✐ ❀ ❨ ✐ ✮ . X ✐ ✷ R ❞ , ❨ ✐ ✷ ❢ ✵ ❀ ✶ ❣ , i = 1,. . . , n. ◮ Convex logit loss ( ❵ ❝ is cvx in θ ) ✏ ✑ ❵ ❝ ✭ θ ❀ Z ✮ ❂ ❨ ❤ X ❀ θ ✐ � ❧♦❣ ✶ ✰ ❡①♣✭ ❤ X ❀ θ ✐ ✮ ✿ ◮ Non-convex loss ( ❵ is not cvx in θ ) ✏ ✑ ✷ ❵ ✭ θ ❀ Z ✮ ❂ ❨ � ✛ ✭ ❤ X ❀ θ ✐ ✮ ❀ where ✛ ✭ t ✮ ❂ ✶ ❂ ✭✶ ✰ ❡①♣✭ t ✮✮ ✿ ◮ Empirical Risk ♥ ❳ ❘ ♥ ✭ θ ✮ ❂ ✶ ❜ ❵ ✭ θ ❀ Z ✐ ✮ ✿ ♥ ✐ ❂✶ ◮ Empirical risk minimizer ❫ ❜ θ ♥ ❂ ❛r❣ ♠✐♥ ❘ ♥ ✭ θ ✮ ✿ θ ✷ B ❞ ✭ ❘ ✮ Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 3 / 17
Why use non-convex loss? ◮ Comparing to logistic regression, non-convex formulation is robust to ourliers. ◮ This model is the same as neural network with a single layer and a single node. Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 4 / 17
Why use non-convex loss? ◮ Comparing to logistic regression, non-convex formulation is robust to ourliers. ◮ This model is the same as neural network with a single layer and a single node. Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 4 / 17
❜ ❘ ♥ ✭ ✮ A negative theoretical result 1996 [AHW ✰ 96]) Theorem (Auer et. al. For the non-convex binary classification problem, for any ♥ and ❞ , ✐ ❂✶ such that the empirical risk ❜ there exists a dataset ✭ x ✐ ❀ ② ✐ ✮ ♥ ❘ ♥ ✭ θ ✮ ❞ ❝ ❞ distinct local minima. has ❜ ♥ Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 5 / 17
A negative theoretical result 1996 [AHW ✰ 96]) Theorem (Auer et. al. For the non-convex binary classification problem, for any ♥ and ❞ , ✐ ❂✶ such that the empirical risk ❜ there exists a dataset ✭ x ✐ ❀ ② ✐ ✮ ♥ ❘ ♥ ✭ θ ✮ ❞ ❝ ❞ distinct local minima. has ❜ ♥ Seems to imply the landscape of the non-convex empirical risk ❜ ❘ ♥ ✭ θ ✮ is very rough. Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 5 / 17
A negative theoretical result 1996 [AHW ✰ 96]) Theorem (Auer et. al. For the non-convex binary classification problem, for any ♥ and ❞ , ✐ ❂✶ such that the empirical risk ❜ there exists a dataset ✭ x ✐ ❀ ② ✐ ✮ ♥ ❘ ♥ ✭ θ ✮ ❞ ❝ ❞ distinct local minima. has ❜ ♥ Seems to imply the landscape of the non-convex empirical risk ❜ ❘ ♥ ✭ θ ✮ is very rough. Is this the end of the world of non-convex binary classification? Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 5 / 17
Non-convex formulation of binary classification On real data, we "always" observe a unique minimum! Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 6 / 17
Non-convex formulation of binary classification On real data, we "always" observe a unique minimum! Why? Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 6 / 17
Non-convex formulation of binary classification On real data, we "always" observe a unique minimum! Why? Data generated by nature is not against us! Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 6 / 17
A negative theoretical result Theorem (Auer et. al. . 1996 [AHW ✰ 96]) For the non-convex binary classification problem, for all ♥ ❃ ✵ ✐ ❂✶ such that the empirical risk ❜ there exists a dataset ✭ x ✐ ❀ ② ✐ ✮ ♥ ❘ ♥ ✭ θ ✮ ❞ ❝ ❞ distinct local minima. has ❜ ♥ Seems to imply the landscape of the non-convex empirical risk ❜ ❘ ♥ ✭ θ ✮ is very rough. Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 7 / 17
Our main positive result Theorem (Mei, Bai, Montanari. 2016 [MBM16]) Assume X ✐ are i.i.d. sub-Gaussian random vectors, and ❨ ✐ are generated via P ✭ ❨ ✐ ❂ ✶ ❥ X ✐ ✮ ❂ ✛ ✭ ❤ X ✐ ❀ θ ✵ ✐ ✮ . Then there exists a constant ❈ depending on ✍ , such that as long as ♥ ✕ ❈❞ ❧♦❣ ❞ , the following happens with probability at least ✶ � ✍ : ❘ ♥ ✭ θ ✮ has a unique local minimizer ❫ ✭ ❛ ✮ ❜ θ ♥ in B ❞ ✭ 0 ❀ ❘ ✮ . ♣ ✭ ❜ ✮ ❫ θ ♥ satisfies ❦ ❫ θ ♥ � θ ✵ ❦ ✷ ✔ ❈ ✭ ❞ ❧♦❣ ♥ ✮ ❂♥ . ✭ ❝ ✮ Gradient descent converges exponentially fast to ❫ θ ♥ . The landscape of the non-convex empirical risk ❜ ❘ ♥ ✭ θ ✮ is actually smooth! Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 8 / 17
Why assuming a statistical model make the landscape of emprical risk smooth? ✐✿✐✿❞✿ 1 Assuming a statistical model Z ✐ ✘ P Z , ✐ ❂ ✶ ❀ ✿ ✿ ✿ ❀ ♥ , we can define the population risk ✧ ★ ❤ ✐ ♥ ❳ ✶ ❜ ❘ ✭ θ ✮ ❂ E Z ❘ ♥ ✭ θ ✮ ❂ E Z ❵ ✭ θ ❀ Z ✐ ✮ ✿ ♥ ✐ ❂✶ The population risk is usually very smooth. 2 We can transfer the good properties of the population risk to the empirical risk using uniform convergence argument. So empirical risk will be also smooth. Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 9 / 17
Why assuming a statistical model make the landscape of emprical risk smooth? ✐✿✐✿❞✿ 1 Assuming a statistical model Z ✐ ✘ P Z , ✐ ❂ ✶ ❀ ✿ ✿ ✿ ❀ ♥ , we can define the population risk ✧ ★ ❤ ✐ ♥ ❳ ✶ ❜ ❘ ✭ θ ✮ ❂ E Z ❘ ♥ ✭ θ ✮ ❂ E Z ❵ ✭ θ ❀ Z ✐ ✮ ✿ ♥ ✐ ❂✶ The population risk is usually very smooth. 2 We can transfer the good properties of the population risk to the empirical risk using uniform convergence argument. So empirical risk will be also smooth. Song Mei (ICME, Stanford) The landscape of empirical risk December 3, 2016 9 / 17
Recommend
More recommend