MIT 9.520/6.860 Statistical Learning Theory and Applications Class - PowerPoint PPT Presentation

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco

Vector Spaces Hilbert Spaces Functionals and Operators (Matrices) Linear Operators Probability Theory

R D We like R D because we can ◮ add elements v + w ◮ multiply by numbers 3 v ◮ take scalar products v T w = � D j = 1 v j w j √ � � D ◮ . . . and norms � v � = v T v = j = 1 ( v j ) 2 ◮ . . . and distances d ( v , w ) = � v − w � = � D j = 1 ( v j − w j ) 2 . We want to do the same thing with D = ∞ . . .

Vector Space ◮ A vector space is a set V with binary operations +: V × V → V · : R × V → V and such that for all a , b ∈ R and v , w , x ∈ V : 1. v + w = w + v 2. ( v + w ) + x = v + ( w + x ) 3. There exists 0 ∈ V such that v + 0 = v for all v ∈ V 4. For every v ∈ V there exists − v ∈ V such that v + (− v ) = 0 5. a ( bv ) = ( ab ) v 6. 1 v = v 7. ( a + b ) v = av + bv 8. a ( v + w ) = av + aw ◮ Example: R n , space of polynomials, space of functions.

Inner Product ◮ An inner product is a function �· , ·� : V × V → R such that for all a , b ∈ R and v , w , x ∈ V : 1. � v , w � = � w , v � 2. � av + bw , x � = a � v , x � + b � w , x � 3. � v , v � � 0 and � v , v � = 0 if and only if v = 0. ◮ v , w ∈ V are orthogonal if � v , w � = 0. ◮ Given W ⊆ V , we have V = W ⊕ W ⊥ , where W ⊥ = { v ∈ V | � v , w � = 0 for all w ∈ W } . ◮ Cauchy-Schwarz inequality: � v , w � � � v , v � 1 / 2 � w , w � 1 / 2 .

Norm ◮ A norm is a function � · � : V → R such that for all a ∈ R and v , w ∈ V : 1. � v � � 0, and � v � = 0 if and only if v = 0 2. � av � = | a | � v � 3. � v + w � � � v � + � w � ◮ Can define norm from inner product: � v � = � v , v � 1 / 2 .

Metric ◮ A metric is a function d : V × V → R such that for all v , w , x ∈ V : 1. d ( v , w ) � 0, and d ( v , w ) = 0 if and only if v = w 2. d ( v , w ) = d ( w , v ) 3. d ( v , w ) � d ( v , x ) + d ( x , w ) ◮ Can define metric from norm: d ( v , w ) = � v − w � .

Basis ◮ B = { v 1 , . . . , v n } is a basis of V if every v ∈ V can be uniquely decomposed as v = a 1 v 1 + · · · + a n v n for some a 1 , . . . , a n ∈ R . ◮ An orthonormal basis is a basis that is orthogonal ( � v i , v j � = 0 for i � = j ) and normalized ( � v i � = 1).

Hilbert Space, overview ◮ Goal: to understand Hilbert spaces (complete inner product spaces) and to make sense of the expression ∞ � f = � f , φ i � φ i , f ∈ H i = 1 ◮ Need to talk about: 1. Cauchy sequence 2. Completeness 3. Density 4. Separability

Cauchy Sequence ◮ Recall: lim n → ∞ x n = x if for every ǫ > 0 there exists N ∈ N such that � x − x n � < ǫ whenever n � N . ◮ ( x n ) n ∈ N is a Cauchy sequence if for every ǫ > 0 there exists N ∈ N such that � x m − x n � < ǫ whenever m , n � N . ◮ Every convergent sequence is a Cauchy sequence (why?)

Completeness ◮ A normed vector space V is complete if every Cauchy sequence converges. ◮ Examples: 1. Q is not complete. 2. R is complete (axiom). 3. R n is complete. 4. Every finite dimensional normed vector space (over R ) is complete.

Hilbert Space ◮ A Hilbert space is a complete inner product space. ◮ Examples: 1. R n 2. Every finite dimensional inner product space. n = 1 | a n ∈ R , � ∞ n = 1 a 2 3. ℓ 2 = { ( a n ) ∞ n < ∞ } � 1 0 f ( x ) 2 dx < ∞ } 4. L 2 ([ 0, 1 ]) = { f : [ 0, 1 ] → R |

Density ◮ Y is dense in X if Y = X . ◮ Examples: 1. Q is dense in R . 2. Q n is dense in R n . 3. Weierstrass approximation theorem: polynomials are dense in continuous functions (with the supremum norm, on compact domains).

Separability ◮ X is separable if it has a countable dense subset. ◮ Examples: 1. R is separable. 2. R n is separable. 3. ℓ 2 , L 2 ([ 0, 1 ]) are separable.

Orthonormal Basis ◮ A Hilbert space has a countable orthonormal basis if and only if it is separable. ◮ Can write: ∞ � � f , φ i � φ i for all f ∈ H . f = i = 1 ◮ Examples: 1. Basis of ℓ 2 is ( 1, 0, . . . , ) , ( 0, 1, 0, . . . ) , ( 0, 0, 1, 0, . . . ) , . . . 2. Basis of L 2 ([ 0, 1 ]) is 1, 2 sin 2 π nx , 2 cos 2 π nx for n ∈ N

Maps Next we are going to review basic properties of maps on a Hilbert space. ◮ functionals: Ψ : H → R ◮ linear operators A : H → H , such that A ( af + bg ) = aAf + bAg , with a , b ∈ R and f , g ∈ H .

Representation of Continuous Functionals Let H be a Hilbert space and g ∈ H , then Ψ g ( f ) = � f , g � , f ∈ H is a continuous linear functional. Riesz representation theorem The theorem states that every continuous linear functional Ψ can be written uniquely in the form, Ψ ( f ) = � f , g � for some appropriate element g ∈ H .

Matrix ◮ Every linear operator L : R m → R n can be represented by an m × n matrix A . ◮ If A ∈ R m × n , the transpose of A is A ⊤ ∈ R n × m satisfying � Ax , y � R m = ( Ax ) ⊤ y = x ⊤ A ⊤ y = � x , A ⊤ y � R n for every x ∈ R n and y ∈ R m . ◮ A is symmetric if A ⊤ = A .

Eigenvalues and Eigenvectors ◮ Let A ∈ R n × n . A nonzero vector v ∈ R n is an eigenvector of A with corresponding eigenvalue λ ∈ R if Av = λ v . ◮ Symmetric matrices have real eigenvalues. ◮ Spectral Theorem: Let A be a symmetric n × n matrix. Then there is an orthonormal basis of R n consisting of the eigenvectors of A . ◮ Eigendecomposition: A = V Λ V ⊤ , or equivalently, n � λ i v i v ⊤ A = i . i = 1

Singular Value Decomposition ◮ Every A ∈ R m × n can be written as A = U Σ V ⊤ , where U ∈ R m × m is orthogonal, Σ ∈ R m × n is diagonal, and V ∈ R n × n is orthogonal. ◮ Singular system: AA ⊤ u i = σ 2 Av i = σ i u i i u i A ⊤ u i = σ i v i A ⊤ Av i = σ 2 i v i

Matrix Norm ◮ The spectral norm of A ∈ R m × n is � � � A � spec = σ max ( A ) = λ max ( AA ⊤ ) = λ max ( A ⊤ A ) . ◮ The Frobenius norm of A ∈ R m × n is � � min { m , n } m n � � � � � � � a 2 σ 2 � A � F = ij = i . � � i = 1 j = 1 i = 1

Positive Definite Matrix A real symmetric matrix A ∈ R m × m is positive definite if x T Ax > 0, ∀ x ∈ R m . A positive definite matrix has positive eigenvalues. Note: for positive semi-definite matrices > is replaced by � .

Linear Operator ◮ An operator L : H 1 → H 2 is linear if it preserves the linear structure. ◮ A linear operator L : H 1 → H 2 is bounded if there exists C > 0 such that � Lf � H 2 � C � f � H 1 for all f ∈ H 1 . ◮ A linear operator is continuous if and only if it is bounded.

Adjoint and Compactness ◮ The adjoint of a bounded linear operator L : H 1 → H 2 is a bounded linear operator L ∗ : H 2 → H 1 satisfying � Lf , g � H 2 = � f , L ∗ g � H 1 for all f ∈ H 1 , g ∈ H 2 . ◮ L is self-adjoint if L ∗ = L . Self-adjoint operators have real eigenvalues. ◮ A bounded linear operator L : H 1 → H 2 is compact if the image of the unit ball in H 1 has compact closure in H 2 .

Spectral Theorem for Compact Self-Adjoint Operator ◮ Let L : H → H be a compact self-adjoint operator. Then there exists an orthonormal basis of H consisting of the eigenfunctions of L , L φ i = λ i φ i and the only possible limit point of λ i as i → ∞ is 0. ◮ Eigendecomposition: ∞ � λ i � φ i , ·� φ i . L = i = 1

Probability Space A triple ( Ω , A , P ) , where Ω is a set, A a Sigma Algebra, i.e. a family of subsets of Ω s.t. ◮ X , ∅ ∈ A , ◮ A ∈ A ⇒ Ω \ A ∈ A , ◮ A i ∈ A , i = 1, 2 · · · ⇒ ∪ ∞ i = 1 A i ∈ A . P a probability measure, i.e a function P : A → [ 0, 1 ] ◮ P ( X ) = 1 (hence and P ( ∅ ) = 0), ◮ Sigma additivity: If A i ∈ A , i = 1, 2 . . . are disjoint, then ∞ � P ( ∪ ∞ i = 1 A i ) = P ( A i ) i = 1

Real Random Variables (RV) A measurable function X : Ω → R , i.e. mapping elements of the sigma algebra in open subsets of R . ◮ Law of a random variable: probability measure on R defined as ρ ( I ) = P ( X − 1 ( I )) for all open subsets I ⊂ R . ◮ Probability density function of a probability measure ρ on X : a function p : R → R such that � � d ρ ( x ) = p ( x ) dx I I for open subsets I ⊂ R .

Convergence of Random Variables X i , i = 1, 2, . . . , a sequence of random variables. ◮ Convergence in probability: ∀ ǫ ∈ ( 0, ∞ ) , i → ∞ P ( | X i − X | > ǫ ) = 0. lim ◮ Almost Sure Convergence: � � i → ∞ X i = X lim = 1. P

Law of Large Numbers X i , i = 1, 2, . . . , sequence of independent copies of a random variable X Weak Law of Large Numbers: �� n � � 1 � � � ∀ ǫ ∈ ( 0, ∞ ) , lim X i − E [ X ] = 0. n → ∞ P � > ǫ � � � n � � i = 1 Strong Law of Large Numbers: � n � 1 � P lim X i = E [ X ] = 1. n n → ∞ i = 1

MIT 9.520/6.860 Statistical Learning Theory and Applications Class - PowerPoint PPT Presentation

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco Vector Spaces Hilbert Spaces Functionals and Operators (Matrices) Linear Operators Probability Theory R D We like R D because we can add

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

Totally Disconnected L.C. Groups: Tidy subgroups and the scale George Willis The University of

Numerical Reduced Order Modeling for Wave Equations in Heterogeneous Media Tom Hagstrom Southern

Constraint Satisfaction Problems Multi-dimensional Selection Problems Given a set of

Towards Model Fusion in Traditional Methods . . . Geophysics: How to How to Estimate . . . How

CS 331: Artificial Intelligence Probability I Thanks to Andrew Moore for some course material 1

Layout Decomposition for Quadruple Patterning Lithography and Beyond Bei Yu , David Z. Pan

Math236 Discrete Maths with Applications P. Ittmann UKZN, Pietermaritzburg Semester 1, 2012

Optimistic Synchronization-Based State-Space Reduction Scott D. Stoller State University of New

MIT 9.520/6.860 Statistical Learning Theory and Applications Class - PowerPoint PPT Presentation

MIT 9.520/6.860 Statistical Learning Theory and Applications Class 0: Mathcamp Lorenzo Rosasco Vector Spaces Hilbert Spaces Functionals and Operators (Matrices) Linear Operators Probability Theory R D We like R D because we can add

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2017 Class Times:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and

9.520/6.860: Statistical Learning Theory and Applications Class: Mon., Wed. 1:00 - 2:30 pm,

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2018 Class Times:

9.5 .520/6.860: : Statistical Learning Theory ry and Applications Class: Tue, Thu 11:00 -

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks &amp; software Andrzej

SR 520 Program City of Kirkland briefing Denise Cieri, P.E. SR 520 Program Administrator WSDOT

SR 520 Pr SR 520 Prog ogram am Sea eattle D ttle Design esign Commi Commissi ssion on SR

Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

2019 AGM 203-1634 Harvey Ave., Kelowna, BC, Canada Tel: +1 250 860 8599 Fax: +1 250 860 1362

Statistical Learning Theory and Support Vector Machines Gert Cauwenberghs Johns Hopkins

Totally Disconnected L.C. Groups: Tidy subgroups and the scale George Willis The University of

Numerical Reduced Order Modeling for Wave Equations in Heterogeneous Media Tom Hagstrom Southern

Constraint Satisfaction Problems Multi-dimensional Selection Problems Given a set of

Towards Model Fusion in Traditional Methods . . . Geophysics: How to How to Estimate . . . How

CS 331: Artificial Intelligence Probability I Thanks to Andrew Moore for some course material 1

Layout Decomposition for Quadruple Patterning Lithography and Beyond Bei Yu , David Z. Pan

Math236 Discrete Maths with Applications P. Ittmann UKZN, Pietermaritzburg Semester 1, 2012

Optimistic Synchronization-Based State-Space Reduction Scott D. Stoller State University of New

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej