Outils Statistiques pour Data Science Part I : Supervised Learning - PowerPoint PPT Presentation

Outils Statistiques pour Data Science Part I : Supervised Learning Massih-Reza Amini Université Grenoble Alpes Laboratoire d’Informatique de Grenoble Massih-Reza.Amini@imag.fr

2 Organization Balikas) Amini, Georgios Balikas) Balikas) Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ Classifjcation Automatique (Massih R Amini, Georgios ❑ Clustering (Massih R Amini, Georgios Balikas) ❑ Représentation et indexation d’un document (Massih R ❑ Recherche de thèmes latents (Marianne Clausel, Georgios ❑ Visualisation (Marianne Clausel, Georgios Balikas)

3 Learning and Inference The process of inference is done in three steps: 1. Observe a phenomenon, 2. Construct a model of the phenomenon, 3. Do predictions. These steps are involved in more or less all natural sciences! The aim of learning is to automate this process, The aim of the learning theory is to formalize the process. Massih-Reza.Amini@imag.fr Introduction to Data-Science

3 Learning and Inference The process of inference is done in three steps: 1. Observe a phenomenon, 2. Construct a model of the phenomenon, 3. Do predictions. All that is necessary to reduce the whole nature of laws similar to those which Newton discovered with the aid of calculus, is to have a suffjcient number of observations and a mathematics that is complex enough (Marquis de Condorcet, 1785) The aim of learning is to automate this process, The aim of the learning theory is to formalize the process. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ These steps are involved in more or less all natural sciences!

3 Learning and Inference The process of inference is done in three steps: 1. Observe a phenomenon, 2. Construct a model of the phenomenon, 3. Do predictions. The aim of the learning theory is to formalize the process. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ These steps are involved in more or less all natural sciences! ❑ The aim of learning is to automate this process,

3 Learning and Inference The process of inference is done in three steps: 1. Observe a phenomenon, 2. Construct a model of the phenomenon, 3. Do predictions. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ These steps are involved in more or less all natural sciences! ❑ The aim of learning is to automate this process, ❑ The aim of the learning theory is to formalize the process.

4 Induction vs. deduction particular facts or instances. which a conclusion follows necessarily from the stated premises; it is an inference by reasoning from the general to the specifjc. This is how mathematicians prove theorems from axioms. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ Induction is the process of deriving general principles from ❑ Deduction is, in the other hand, the process of reasoning in

5 Pattern recognition If we consider the context of supervised learning for pattern recognition: representation of an observation, class label), the theory of ML we consider the binary classifjcation case Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The data consist of pairs of examples (vector ❑ Class labels are often Y = { 1 , . . . , K } with K large (but in Y = {− 1 , +1 } ), ❑ The learning algorithm constructs an association between the vector representation of an observation → class label, ❑ Aim: Make few errors on unseen examples.

6 Pattern recognition (Exemple) IRIS classifjcation, Ronald Fisher (1936) Iris Setosa Iris Versicolor Iris Virginica Massih-Reza.Amini@imag.fr Introduction to Data-Science

7 Pattern recognition (Exemple) relevant common characteristics, that constitute the features of their vector representations. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ First step is to formalize the perception of the fmowers with ❑ This usually requires expert knowledge.

8 Pattern recognition (Exemple) Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ If observations are from a Field of Irises

8 Pattern recognition (Exemple) ... ... Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ If observations are from a Field of Irises then they become

8 Pattern recognition (Exemple) If observations are from a Field of Irises associated labels is generally time consuming. using deep neural networks function that maps vectorised observations (inputs) to their associated outputs Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The constitution of vectorised observations and their ❑ Many studies are now focused on representation learning ❑ Second step: Learning translates then in the search of a

9 Pattern recognition Massih-Reza.Amini@imag.fr Introduction to Data-Science 2. Trouver les séparateurs 0. Base d’apprentissage 1. Vecteur de représentation 3. Nouveaux exemples 5. Prédire les étiquettes des nouveaux exemples

10 Approximation - Interpolation It is always possible to construct a function that exactly fjts the data. Massih-Reza.Amini@imag.fr Introduction to Data-Science

10 Approximation - Interpolation It is always possible to construct a function that exactly fjts the data. Is it reasonable? Massih-Reza.Amini@imag.fr Introduction to Data-Science

11 Occam razor Idea: Search for regularities (or repetitions) in the observed phenomenon, generalization is done from the passed model ... But how to measure the simplicity ? 1. Number of constantes, 2. Number de parameters, 3. ... Massih-Reza.Amini@imag.fr Introduction to Data-Science observations to the new futur ones ⇒ Take the most simple

12 Basic Hypotheses Two types of hypotheses: Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ Past observations are related to the future ones → The phenomenon is stationary ❑ Observations are independently generated from a source → Notion of independence

13 Aims hypotheses? overfjtting, Massih-Reza.Amini@imag.fr Introduction to Data-Science → How can one do predictions with past data? What are the ❑ Give a formel defjnition of learning, generalization, ❑ Characterize the performance of learning algorithms, ❑ Construct better algorithms.

14 Probabilistic model Relations between the past and future observations. individual information, on the phenomenon which generates the observations. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ Independence: Each new observation provides a maximum ❑ identically Distributed : Observations provide information

15 Formally independently distributed ( i.i.d ) with respect to an unknown error. Massih-Reza.Amini@imag.fr Introduction to Data-Science We consider an input space X ⊆ R d and an output space Y . Assumption: Example pairs ( x , y ) ∈ X × Y are identically and but fjxed probability distribution D . Samples: We observe a sequence of m pairs of examples ( x i , y i ) generated i.i.d from D . Aim: Construct a prediction function f : X → Y which predicts an output y for a given new x with a minimum probability of

16 Supervised Learning Massih-Reza.Amini@imag.fr otherwise. misclassifjcation error: The risk function considered in classifjcation is usually the Introduction to Data-Science probability of error ❑ Discriminant models directly fjnd a classifjcation function f : X → Y from a given class of functions F ; ❑ The function found should be the one having the lowest ∫ R ( f ) = E ( x,y ) ∼D L ( x, y ) = L ( f ( x ) , y ) d D ( x, y ) X×Y Where L is a risk function defjned as L : Y × Y → R + ∀ ( x, y ); L ( f ( x ) , y ) = [ [ f ( x ) ̸ = y ] ] Where [ [ π ] ] is equal to 1 if the predicate π is true and 0

17 Empirical risk minimization (ERM) principle Massih-Reza.Amini@imag.fr not the right way of proceeding (occam razor) ... Introduction to Data-Science form of the true risk cannot be driven, so the prediction ❑ As the probability distribution D is unknown, the analytic function cannot be found directly on R ( f ) . ❑ Empirical risk minimization (ERM) principle: Find f by minimizing the unbiased estimator of R on a given training set S = ( x i , y i ) m i =1 : m R m ( f, S ) = 1 ∑ ˆ L ( f ( x i ) , y i ) m i =1 ❑ However, without restricting the class of functions this is

18 Consider now, a learning algorithm which minimizes the Massih-Reza.Amini@imag.fr otherwise ERM principle, problem empirical risk by choosing a function in the function class in the following way ; after reviewing a training set Introduction to Data-Science Suppose that the input dimension is d = 1 , let the input space X be the interval [ a, b ] ⊂ R where a and b are real values such that a < b , and suppose that the output space is {− 1 , +1 } . Moreover, suppose that the distribution D generating the examples ( x , y ) is an uniform distribution over [ a, b ] × {− 1 } . F = { f : [ a, b ] → {− 1 , +1 }} (also denoted as F = {− 1 , +1 } [ a,b ] ) S = { ( x 1 , y 1 ) , . . . , ( x m , y m ) } the algorithm outputs the prediction function f S such that { − 1 , if x ∈ { x 1 , . . . , x m } f S ( x ) = +1 ,

Outils Statistiques pour Data Science Part I : Supervised Learning - PowerPoint PPT Presentation

Outils Statistiques pour Data Science Part I : Supervised Learning Massih-Reza Amini Universit Grenoble Alpes Laboratoire dInformatique de Grenoble Massih-Reza.Amini@imag.fr 2 Organization Balikas) Amini, Georgios Balikas) Balikas)

Outils Statistiques pour Data Science Part II : Unsupervised Learning Massih-Reza Amini

Session 3 Statistiques pour les donnes omiques Teachers: Claire Vandiedonck, Jacques van Helden

Techniques et outils pour la vrification de systmes-sur-puces au niveau transactionnel

M ethodologies and outils pour l evaluation des protocoles de transport dans les r

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

pour le dveloppement ? Serge Tomasi, Directeur Adjoint Direction de la Coopration pour le

Descripteurs divers niveaux de concepts pour la classification concepts pour la classification

Statistiques 2014 FIXME Hackerspace Jean-Baptiste (Rorist) Aubort - April 26, 2015 Summary

Fast overview about the CERT-TCC Helmi Rais CERT-TCC Team Manager Helmi.rais@ansi.tn Les IT en

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

M ISSION AND VALUES AVAID - Association de Volontaires pour lAide au Dveloppement,

Simulations Thermo-Hydro- Mcaniques pour le stockage profond des dchets nuclaires

Integrity Under Pressure Systmes intgrs pour conduites en pression Integrated systems for

codage pour ados CAMP DT 2018 Prsent par Axiom Academy Offert par lAcad mie

Systme lectronique de monitoring de d donnes pour sportifs de haut niveau if d h i

Unified Patent Court Mock Trial Paris April, 2 nd 2015 U.J.U.B. - Union pour la Juridiction

Computing with sequent proof terms: progress report rito Santo 1 Jos e Esp Centro de

20 Years of Evolutionary Multi-Objective Optimization: What Has Been Done and What Remains To Be

Complex submanifolds and holonomy joint work with A.J. Di Scala and C. Olmos Sergio Console July

Curry-Howard isomorphism for sequent calculus at last Jos e Carlos Esp rito Santo Centro

Records in Ludics Myriam Quatrini & Eugenia Sironi Workshop LOCI: Type Theory with records

Introduction to linear logic Emmanuel Beffara IML, CNRS & Universit dAix-Marseille

Model theoretic approach to de Finetti theory Artem Chernikov Hebrew University of Jerusalem

Topologically invariant -ideals on homogeneous Polish spaces Taras Banakh, lowski, Szymon

Outils Statistiques pour Data Science Part I : Supervised Learning - PowerPoint PPT Presentation

Outils Statistiques pour Data Science Part I : Supervised Learning Massih-Reza Amini Universit Grenoble Alpes Laboratoire dInformatique de Grenoble Massih-Reza.Amini@imag.fr 2 Organization Balikas) Amini, Georgios Balikas) Balikas)

Outils Statistiques pour Data Science Part II : Unsupervised Learning Massih-Reza Amini

Session 3 Statistiques pour les donnes omiques Teachers: Claire Vandiedonck, Jacques van Helden

Techniques et outils pour la vrification de systmes-sur-puces au niveau transactionnel

M ethodologies and outils pour l evaluation des protocoles de transport dans les r

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

pour le dveloppement ? Serge Tomasi, Directeur Adjoint Direction de la Coopration pour le

Descripteurs divers niveaux de concepts pour la classification concepts pour la classification

Statistiques 2014 FIXME Hackerspace Jean-Baptiste (Rorist) Aubort - April 26, 2015 Summary

Fast overview about the CERT-TCC Helmi Rais CERT-TCC Team Manager Helmi.rais@ansi.tn Les IT en

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

M ISSION AND VALUES AVAID - Association de Volontaires pour lAide au Dveloppement,

Simulations Thermo-Hydro- Mcaniques pour le stockage profond des dchets nuclaires

Integrity Under Pressure Systmes intgrs pour conduites en pression Integrated systems for

codage pour ados CAMP DT 2018 Prsent par Axiom Academy Offert par lAcad mie

Systme lectronique de monitoring de d donnes pour sportifs de haut niveau if d h i

Unified Patent Court Mock Trial Paris April, 2 nd 2015 U.J.U.B. - Union pour la Juridiction

Computing with sequent proof terms: progress report rito Santo 1 Jos e Esp Centro de

20 Years of Evolutionary Multi-Objective Optimization: What Has Been Done and What Remains To Be

Complex submanifolds and holonomy joint work with A.J. Di Scala and C. Olmos Sergio Console July

Curry-Howard isomorphism for sequent calculus at last Jos e Carlos Esp rito Santo Centro

Records in Ludics Myriam Quatrini &amp; Eugenia Sironi Workshop LOCI: Type Theory with records

Introduction to linear logic Emmanuel Beffara IML, CNRS &amp; Universit dAix-Marseille

Model theoretic approach to de Finetti theory Artem Chernikov Hebrew University of Jerusalem

Topologically invariant -ideals on homogeneous Polish spaces Taras Banakh, lowski, Szymon

Records in Ludics Myriam Quatrini & Eugenia Sironi Workshop LOCI: Type Theory with records

Introduction to linear logic Emmanuel Beffara IML, CNRS & Universit dAix-Marseille