neurocomp machine learning and validation
play

NeuroComp Machine Learning and Validation Mich` ele Sebag - PowerPoint PPT Presentation

NeuroComp Machine Learning and Validation Mich` ele Sebag http://tao.lri.fr/tiki-index.php Nov. 16th 2011 Validation, the questions 1. What is the result ? 2. My results look good. Are they ? 3. Does my system outperform yours ? 4. How to


  1. NeuroComp Machine Learning and Validation Mich` ele Sebag http://tao.lri.fr/tiki-index.php Nov. 16th 2011

  2. Validation, the questions 1. What is the result ? 2. My results look good. Are they ? 3. Does my system outperform yours ? 4. How to set up my system ?

  3. Contents Position of the problem Background notations Difficulties The learning process The villain Validation Performance indicators Estimating an indicator Testing a hypothesis Comparing hypotheses Validation Campaign The point of parameter setting Racing Expected Global Improvement

  4. Contents Position of the problem Background notations Difficulties The learning process The villain Validation Performance indicators Estimating an indicator Testing a hypothesis Comparing hypotheses Validation Campaign The point of parameter setting Racing Expected Global Improvement

  5. Supervised Machine Learning Context Oracle World → instance x i → ↓ y i Training set E = { ( x i , y i ) , i = 1 . . . n , x i ∈ X , y i ∈ Y} Input : Output : Hypothesis h : X �→ Y Criterion : few mistakes (details later)

  6. Definitions Example ◮ row : example/ case ◮ column : fea- ture/variables/attribute ◮ attribute : class/label Instance space X ◮ Propositionnal : R d X ≡ I ◮ Relational : ex. chemistry. molecule: alanine

  7. Contents Position of the problem Background notations Difficulties The learning process The villain Validation Performance indicators Estimating an indicator Testing a hypothesis Comparing hypotheses Validation Campaign The point of parameter setting Racing Expected Global Improvement

  8. Difficulty factors Quality of examples / of representation + Relevant features Feature extraction − Not enough data − Noise ; missing data − Structured data : spatio-temporal, relational, textual, videos .. Distribution of examples + Independent, identically distributed examples − Other: robotics; data stream; heterogeneous data Prior knowledge + Constraints on sought solution + Criteria; loss function

  9. Difficulty factors, 2 Learning criterion + Convex function: a single optimum n 2 ց Complexity : n , nlogn , Scalability − Combinatorial optimization What is your agenda ? ◮ Prediction performance ◮ Causality ◮ INTELLIGIBILITY ◮ Simplicity ◮ Stability ◮ Interactivity, visualisation

  10. Difficulty factors, 3 Crossing the chasm ◮ There exists no killer algorithm ◮ Few general recommendations about algorithm selection Performance criteria ◮ Consistency When number n of examples goes to ∞ and the target concept h ∗ is in H Algorithm finds ˆ h n , with lim n →∞ h n = h ∗ ◮ Convergence speed || h ∗ − h n || = O (1 / n ) , O (1 / √ n ) , O (1 / ln n )

  11. Contents Position of the problem Background notations Difficulties The learning process The villain Validation Performance indicators Estimating an indicator Testing a hypothesis Comparing hypotheses Validation Campaign The point of parameter setting Racing Expected Global Improvement

  12. Context Related approaches criteria ◮ Data Mining, KDD scalability ◮ Statistics and data analysis Model selection and fitting; hypothesis testing ◮ Machine Learning Prior knowledge; representations; distributions ◮ Optimisation well-posed / ill-posed problems ◮ Computer Human Interface No ultimate solution: a dialog ◮ High performance computing Distributed data; privacy

  13. Methodology Phases 1. Collect data expert, DB 2. Clean data stat, expert 3. Select data stat, expert 4. Data Mining / Machine Learning ◮ Description what is in data ? ◮ Prediction Decide for one example ◮ Agregate Take a global decision 5. Visualisation chm 6. Evaluation stat, chm 7. Collect new data expert, stat An interative process depending on expectations, data, prior knowledge, current results

  14. Contents Position of the problem Background notations Difficulties The learning process The villain Validation Performance indicators Estimating an indicator Testing a hypothesis Comparing hypotheses Validation Campaign The point of parameter setting Racing Expected Global Improvement

  15. Supervised Machine Learning Context Oracle World → instance x i → ↓ y i Input Training set E = { ( x i , y i ) , i = 1 . . . n , x i ∈ X , y i ∈ Y} Tasks ◮ Select hypothesis space H ◮ Assess hypothesis h ∈ H score ( h ) ◮ Find best hypothesis h ∗

  16. What is the point ? Underfitting Overfitting The point is not to be perfect on the training set

  17. What is the point ? Underfitting Overfitting The point is not to be perfect on the training set The villain: overfitting Test error Training error Complexity of Hypotheses

  18. What is the point ? Prediction good on future instances Necessary condition: Future instances must be similar to training instances “identically distributed” Minimize (cost of) errors ℓ ( y , h ( x )) ≥ 0 not all mistakes are equal.

  19. Error: theoretical approach Minimize expectation of error cost � Minimize E [ ℓ ( y , h ( x ))] = ℓ ( y , h ( x )) p ( x , y ) dx dy X × Y

  20. Error: theoretical approach Minimize expectation of error cost � Minimize E [ ℓ ( y , h ( x ))] = ℓ ( y , h ( x )) p ( x , y ) dx dy X × Y Principle Si h “is well-behaved“ on E , and h is ”sufficiently regular” h will be well-behaved in expectation. � n i =1 F ( x i ) E [ F ] ≤ + c ( F , n ) n

  21. Classification, Problem posed ∼ P ( x , y ) INPUT E = { ( x i , y i ) , x i ∈ X , y i ∈ { 0 , 1 } , i = 1 . . . n } HYPOTHESIS SPACE SEARCH SPACE H h : X �→ { 0 , 1 } LOSS FUNCTION ℓ : Y × Y �→ I R OUTPUT h ∗ = arg max { score ( h ) , h ∈ H}

  22. Classification, criteria Generalisation error � Err ( h ) = E [ ℓ ( y , h ( x ))] = ℓ ( y , h ( x )) dP ( x , y ) Empirical error n Err e ( h ) = 1 � ℓ ( y i , h ( x i )) n i =1 Bound risk minimization Err ( h ) < Err e ( h ) + F ( n , d ( H )) d ( H ) = VC-dimension of H

  23. Dimension of Vapnik Cervonenkis Principle Given hypothesis space H : X �→ { 0 , 1 } Given n points x 1 , . . . , x n in X . If, ∀ ( y i ) n i =1 ∈ { 0 , 1 } n , ∃ h ∈ H / h ( x i ) = y i , H shatters { x 1 , . . . , x n } R p Example: X = I R p ) = p + 1 d (hyperplanes in I WHY: if H shatters E , E doesn’t tell anything o o o o o o o 3 pts shattered by a line 4 points, non shattered Definition d ( H ) = max { n / ∃ ( x 1 . . . , x n } shattered by H}

  24. Classification: Ingredients of error Bias Bias ( H ): error of the best hypothesis h ∗ in H Variance Variance of h n depending on E ^ ^ h VARIANCE h ^ h Hypothesis space * h BIAS h Optimization negligible in small scale takes over in large scale (Google)

  25. Contents Position of the problem Background notations Difficulties The learning process The villain Validation Performance indicators Estimating an indicator Testing a hypothesis Comparing hypotheses Validation Campaign The point of parameter setting Racing Expected Global Improvement

  26. Validation: Three questions Define a good indicator of quality ◮ Misclassification cost ◮ Area under the ROC curve Computing an estimate thereof ◮ Validation set ◮ Cross-Validation ◮ Leave one out ◮ Bootstrap Compare estimates: Tests and confidence levels

  27. Which indicator, which estimate: it depends. Settings ◮ Large/few data Data distribution ◮ Dependent/independent examples ◮ balanced/imbalanced classes

  28. Contents Position of the problem Background notations Difficulties The learning process The villain Validation Performance indicators Estimating an indicator Testing a hypothesis Comparing hypotheses Validation Campaign The point of parameter setting Racing Expected Global Improvement

  29. Performance indicators Binary class ◮ h ∗ the truth ◮ ˆ h the learned hypothesis Confusion matrix ˆ h / h ∗ 1 0 1 a b a + b 0 c d c+d a+c b+d a + b + c + d

  30. Performance indicators, 2 ˆ h / h ∗ 1 0 1 a b a + b 0 c d c+d a+c b+d a + b + c + d ◮ Misclassification rate b + c a + b + c + d a ◮ Sensitivity, True positive rate (TP) a + c ◮ Specificity, False negative rate (FN) b b + d a ◮ Recall a + c ◮ Precision a a + b Note: always compare to random guessing / baseline alg.

  31. Performance indicators, 3 The Area under the ROC curve ◮ ROC: Receiver Operating Characteristics ◮ Origin: Signal Processing, Medicine Principle h : X �→ I h ( x ) measures the risk of patient x R h leads to order the examples: + + + − + − + + + + − − − + − − − + − − − − − − − − − − −−

  32. Performance indicators, 3 The Area under the ROC curve ◮ ROC: Receiver Operating Characteristics ◮ Origin: Signal Processing, Medicine Principle h : X �→ I h ( x ) measures the risk of patient x R h leads to order the examples: + + + − + − + + + + − − − + − − − + − − − − − − − − − − −− Given a threshold θ , h yields a classifier: Yes iff h ( x ) > θ . + + + − + − + + ++ | − − − + − − − + − − − − − − − − − − −− Here, TP ( θ )= .8; FN ( θ ) = .1

  33. ROC

  34. The ROC curve R 2 : M ( θ ) = (1 − TNR , FPR ) θ �→ I Ideal classifier: (0 False negative,1 True positive) Diagonal (True Positive = False negative) ≡ nothing learned.

Recommend


More recommend