advanced machine learning
play

Advanced Machine Learning Emilie Chouzenoux (1) , L. Omar Chehab (2) - PowerPoint PPT Presentation

Advanced Machine Learning Emilie Chouzenoux (1) , L. Omar Chehab (2) and Frdric Pascal (3) (1) Center for Computer Vision (CVN), CentraleSuplec / Opis Team, Inria (2) Parietal Team, Inria (3) Laboratory of Signals and Systems (L2S),


  1. Advanced Machine Learning Emilie Chouzenoux (1) , L. Omar Chehab (2) and Frédéric Pascal (3) (1) Center for Computer Vision (CVN), CentraleSupélec / Opis Team, Inria (2) Parietal Team, Inria (3) Laboratory of Signals and Systems (L2S), CentraleSupélec, University Paris-Saclay {emilie.chouzenoux, frederic.pascal}@centralesupelec.fr, l-emir-omar.chehab@inria.fr http://www-syscom.univ-mlv.fr/~chouzeno/ http://fredericpascal.blogspot.fr MDS Sept. - Dec., 2020

  2. Contents 1 Introduction - Reminders of probability theory and mathematical statistics (Bayes, estimation, tests) - FP 2 Robust regression approaches - EC / OC 3 Hierarchical clustering - FP / OC 4 Stochastic approximation algorithms - EC / OC 5 Nonnegative matrix factorization (NMF) - EC / OC 6 Mixture models fitting / Model Order Selection - FP / OC 7 Inference on graphical models - EC / VR 8 Exam

  3. Key references for this course Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. Hastie, T., Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second edition. Springer, 2009. James, G., Witten, D., Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning, with Applications in R. Springer, 2013 + many many references... F. Pascal 3 / 85

  4. Course 1 Introduction - Reminders of probability theory and mathematical statistics F. Pascal 4 / 85

  5. I. Introduction in stat. signal processing II. Random Variables / Vectors / CV III. Essential theorems IV. Statistical modelling V. Theory of Point Estimation VI. Hypothesis testing - Decision theory

  6. What is Machine Learning? Statistical machine learning is concerned with the development of algorithms and techniques that learn from observed data by constructing stochastic models that can be used for making predictions and decisions. Topics covered include Bayesian inference and maximum likelihood modeling; regression, classification, density estimation, clustering, principal component analysis; parametric, semi-parametric, and non-parametric models; basis functions, neural networks, kernel methods, and graphical models; deterministic and stochastic optimization; overfitting, regularization, and validation. Introduction in stat. signal processing F. Pascal 5 / 85

  7. From data to processing - robustness, dimension... Big Picture Data driven Model driven ( n > p ) ( n > p ) n > p ( n < p ) n < p R < n , p Classical Regularization Structure Processing a priori Introduction in stat. signal processing F. Pascal 6 / 85

  8. General context Statistical Signal Processing Signals z : multivariate random complex observations (vectors). Example : z ∈ C p Signal corrupted by an additive noise: z = β d ( θ ) + n with n ∼ C N ( 0 , Σ ) , θ and β unknown. Several processes PCA and dimension reduction Parameter estimation Detection / Filtering Clustering / Classification ... Introduction in stat. signal processing F. Pascal 7 / 85

  9. Covariance & Subspace Two quantities common to all these processes “Optimal” processes rely on the second order statistics of z , notably on: The covariance matrix (assume circularity): � zz H � Σ = E Information on the variance and correlations between elements of z . The principal subspace (of rank R ) � � zz H �� Π R = P R E Rank R orthogonal subspace where most of the information lies in. Introduction in stat. signal processing F. Pascal 8 / 85

  10. Examples Estimation (MLE, GMM...) Parameter θ of the signal d ( θ ) to be estimated from observations Example : Maximum Likelihood Estimator (MLE) ( d ( θ ) − z ) H Σ − 1 ( d ( θ ) − z ) min θ Low rank version (e.g. MUSIC): replace Σ − 1 by Π ⊥ Applications : DoA, inverse problems, source separation... Detection (ACE, GLRT, ANMF, MSD...) Binary hypothesis test: is d ( θ 0 ) present? Example : Adaptive Cosine Estimator (ACE, or ANMF): | d ( θ 0 ) H Σ − 1 z | 2 H 1 Λ ACE = ≷ η ( d ( θ 0 ) H Σ − 1 d ( θ 0 ))( z H Σ − 1 z ) H 0 Low rank version: replace Σ − 1 by Π ⊥ Applications : RADAR, imaging, audio... Introduction in stat. signal processing F. Pascal 9 / 85

  11. Filtering (MF, AMF, Projection...) Maximizing the output signal to noise ratio (SNR): Example : Adaptive Matched Filter | d H ( θ ) Σ − 1 z | 2 y = d ( θ 0 ) H Σ − 1 d ( θ 0 ) Low rank version: replace Σ − 1 by Π ⊥ Applications : De-noising, interference cancellation (telecom)... Classification (SVM, K-means, KL divergence...) Select a class for the observations: covariance and subspace are descriptors Example : KL divergence between two distributions (or other divergences, Wasserstein, Riemanian ...) KL ( Z 1 , Z 2 ) 2 [Tr( Σ 2 − 1 Σ 1 ) + Tr( Σ 1 − 1 Σ 2 ) − 2 k ] 1 = �� Σ 1 1/2 Σ 2 Σ 1 1/2 � 1/2 � W 2 2 ( Z 1 , Z 2 ) Tr( Σ 1 ) + Tr( Σ 2 ) − 2Tr = Applications : Machine learning, segmentation, profile determination... Introduction in stat. signal processing F. Pascal 10 / 85

  12. Example of non Gaussianity (1/3): High Resolution SAR images HR SAR images SMDS Data Introduction in stat. signal processing F. Pascal 11 / 85

  13. Example of non Gaussianity (2/3): Hyperspectral data NASA Hyperion sensor Introduction in stat. signal processing F. Pascal 12 / 85

  14. Example of non Gaussianity (3/3): Financial data Nasdaq-100, SP 500 Courtesy of E. Ollila [Ollila18] Introduction in stat. signal processing F. Pascal 13 / 85

  15. I. Introduction in stat. signal processing II. Random Variables / Vectors / CV III. Essential theorems IV. Statistical modelling V. Theory of Point Estimation VI. Hypothesis testing - Decision theory

  16. Menu - Probabilities and statistics basics Example: Fair Six-Sided Die: Sample space: Ω = {1,2,3,4,5,6} Events: Even= {2,4,6} , Odd= {1,3,5} ⊆ Ω 1 1 Probability: P (6) = 6 , P ( Even ) = P ( Odd ) = 2 Outcome: 6 ∈ E . P (6 ∩ Even ) 1/6 1 Conditional probability: P (6 | Even ) = = 1/2 = P ( Even ) 3 General Axioms: P ( � ) = 0 ≤ P ( A ) ≤ 1 = P ( Ω ) , P ( A ∪ B ) + P ( A ∩ B ) = P ( A ) + P ( B ) , P ( A ∩ B ) = P ( A | B ) P ( B ) . Random Variables / Vectors / CV F. Pascal 14 / 85

  17. Menu - Probabilities and statistics basics Example: (Un)fair coin: Ω = { Tail , Head } ≃ {0,1} with P (1) = θ ∈ [0,1] : Likelihood: P (1101 | θ ) = θ × θ × (1 − θ ) × θ Maximum Likelihood (ML) estimate: ˆ θ = argmax θ P (1101 | θ ) = 3 4 Prior: If we are indifferent, then P ( θ ) = const . � ) Evidence: P (1101) = � θ P (1101 | θ ) P ( θ ) = 1 20 (actually P (1101 | θ ) P ( θ ) ∝ θ 3 (1 − θ ) (Bayes rule). Posterior: P ( θ | 1101) = P (1101) Maximum a Posterior (MAP) estimate: ˆ θ = argmax θ P ( θ | 1101) = 3 4 P (11011) 2 Predictive distribution: P (1 | 1101) = P (1101) = 3 Expectation: E [ f | ...] = � θ f ( θ ) P ( θ | ...) , e.g. E [ θ | 1101] = 2 3 Variance: V ( θ | 1101) = E [( θ − E [ θ ]) 2 | 1101] = 2 63 1 Probability density: P ( θ ) = ε P ([ θ , θ + ε ]) for ε → 0 Random Variables / Vectors / CV F. Pascal 15 / 85

  18. Random Variables (r.v.) / Vectors (r.V.) Notations Let X (resp. x ) a random variable (resp. vectors). Denote by P or P θ its probability : P ( X = x ) or P θ ( X = x ) for the discrete case f ( x ) or f θ ( x ) for the continuous case (with PDF) Some other notations: E [.] or E θ [.] (resp. V [.] / V θ [.] ) stands for the statistical expectation (resp. the variance) i.i.d. → Independent (denoted ⊥ ) and Identically Distributed, i.e. same distribution and X ⊥ Y ⇐ ⇒ for any measurable functions h and g , E [ g ( X ) h ( Y )] = E [ g ( X )] E [ h ( Y )] . n -sample ( X 1 ,..., X n ) ⇐ ⇒ X 1 ,..., X n are i.i.d. PDF, CDF and iff resp. means Probability Density Function, Cumulative Distribution Function and “ if and only if’ ’ Random Variables / Vectors / CV F. Pascal 16 / 85

  19. Convergences Multivariate case Let ( x ) n ∈ N ∈ R d a sequence of r.V. and ( x ) ∈ R d defined on the same probability space ( Ω , A , P ) , then a . s Almost Sure CV: x n ⇒ ∃ N ∈ A such that P ( N ) = 0 and n →∞ x ⇐ − − − − → ∀ ω ∈ N c , lim n →∞ x n ( ω ) = x ( ω ) P CV in probability: x n n →∞ P ( � x n − x � ≥ ε ) = 0 where − − n →∞ x ⇐ − − → ⇒ ∀ ε > 0, lim �� d � 1/2 for x ∈ R d . i = 1 x 2 � x � = i P x n n →∞ x ⇐ ⇒ each component converges in probability. − − − − → L p ⇒ ( x ) n ∈ N , x ∈ L p and CV in L p : Let p ∈ N ∗ , x n n →∞ x ⇐ − − − − → � � � x n − x � p E − − n →∞ 0. − − → L p Random Variables / Vectors / CV F. Pascal 17 / 85

  20. Convergence in distribution dist . CV in distribution: x n n →∞ x if for any continuous and bounded − − − − → � � � � function g , one has lim n →∞ E g ( x n ) = E g ( x ) . � The CV in distribution of a sequence of r.V. is stronger than the CV of each component! How to characterise the CV in distribution? Theorem (Levy continuity Theorem) � and ϕ ( u ) = E � the characteristic � � exp( iu t x n ) exp( iu t x ) Let ϕ n ( u ) = E functions of x n and x . Then, dist . ⇒ ∀ u ∈ R d , ϕ n ( u ) − x n n →∞ x ⇐ n →∞ ϕ ( u ). − − − − → − − − → Proposition (a.s., P , dist. convergences ) n →∞ h ( x ) , if h is a continuous function x n − n →∞ x = − − − → ⇒ h ( x n ) − − − − → Discussion on the cv hierarchy... Random Variables / Vectors / CV F. Pascal 18 / 85

Recommend


More recommend