machine learning mt 2017 2 mathematical basics
play

Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase - PowerPoint PPT Presentation

Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October 11, 2017 About this lecture No Machine Learning without rigorous mathematics This should be the most boring lecture Serves as reference


  1. Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October 11, 2017

  2. About this lecture ◮ No Machine Learning without rigorous mathematics ◮ This should be the most boring lecture ◮ Serves as reference for notation used throughout the course ◮ If there are any holes make sure to fill them sooner than later ◮ Attempt Problem Sheet 0 to see where you are standing 1

  3. Outline Today’s lecture ◮ Linear algebra ◮ Calculus ◮ Probability theory 2

  4. Linear algebra We will mostly work in the real vector space: ◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = ( x 1 , . . . , x D ) ∈ R D of dimension D ◮ Matrix: two-dimensional array A ∈ R m × n written as   · · · a 1 , 1 a 1 , 2 a 1 ,n · · · a 2 , 1 a 2 , 2 a 2 ,n     A = . . . ...  . . .  . . .     · · · a m, 1 a m, 2 a m,n 3

  5. Linear algebra We will mostly work in the real vector space: ◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = ( x 1 , . . . , x D ) ∈ R D of dimension D ◮ Matrix: two-dimensional array A ∈ R m × n written as   · · · a 1 , 1 a 1 , 2 a 1 ,n · · · a 2 , 1 a 2 , 2 a 2 ,n     A = . . . ...  . . .  . . .     · · · a m, 1 a m, 2 a m,n ◮ vector x is a R D × 1 matrix ◮ A i,j denotes a i,j ◮ A i, : denotes i -th row ◮ A : ,i denotes i -th column ◮ A T is the transpose of A such that ( A T ) i,j = A j,i ◮ symmetric if A = A T ◮ A ∈ R n × n is diagonal if A i,j = 0 for all i � = j ◮ I n is the n × n diagonal matrix s.t. ( I n ) i,i = 1 3

  6. Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A 4

  7. Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. B i,j = r · A i,j 4

  8. Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. B i,j = r · A i,j ◮ Multiplication: C = A · B s.t. � A i,k · B k,j C i,j = 1 ≤ k ≤ n with A ∈ R m × n , B ∈ R n × p , C ∈ R m × p ◮ associative: A · ( B · C ) = ( A · B ) · C ◮ not commutative in general: A · B � = B · A ◮ distributive wrt. addition: A · ( B + C ) = A · B + A · C ◮ ( A · B ) T = B T · A T ◮ v and w are orthogonal if v T · w = 0 4

  9. Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero 5

  10. Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n 5

  11. Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n 5

  12. Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n ◮ A ∈ R n × n invertible if there is A − 1 ∈ R n × n s.t. A · A − 1 = A − 1 · A = I n 5

  13. Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n ◮ A ∈ R n × n invertible if there is A − 1 ∈ R n × n s.t. A · A − 1 = A − 1 · A = I n ◮ Note that: ◮ A is invertible if rows of A are linearly independent ◮ equivalently if det( A ) � = 0 ◮ If A invertible then A · x = b has solution x = A − 1 · b 5

  14. Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D 6

  15. Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R 6

  16. Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L 1 ◮ Eucledian norm L 2 ◮ Maximum norm L ∞ where � v � ∞ = max 1 ≤ i ≤ D | v i | 6

  17. Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L 1 ◮ Eucledian norm L 2 ◮ Maximum norm L ∞ where � v � ∞ = max 1 ≤ i ≤ D | v i | ◮ Vectors v , w ∈ R D are orthonormal if v and w are orthogonal and � v � 2 = � w � 2 = 1 6

  18. Calculus Functions of one variable f : R → R ◮ First derivative: f ′ ( x ) = d f ( x + h ) − f ( x ) dxf ( x ) = lim h h → 0 ◮ f ′ ( x ∗ ) = 0 means that f ( x ∗ ) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x ∗ with smallest f ( x ∗ ) ◮ Second derivative test to (partially) decide nature of critical point 7

  19. Calculus Functions of one variable f : R → R ◮ First derivative: f ′ ( x ) = d f ( x + h ) − f ( x ) dxf ( x ) = lim h h → 0 ◮ f ′ ( x ∗ ) = 0 means that f ( x ∗ ) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x ∗ with smallest f ( x ∗ ) ◮ Second derivative test to (partially) decide nature of critical point ◮ Differentiation rules: 1 d d d dxx n = n · x n − 1 dxa x = a x · ln( a ) dx log a ( x ) = x · ln( a ) ( f + g ) ′ = f ′ + g ′ ( f · g ) ′ = f ′ · g + f · g ′ ◮ Chain rule: if f = h ( g ) then f ′ = h ′ ( g ) · g ′ 7

  20. Calculus Functions of multiple variables f : R m → R ◮ Partial derivative of f ( x 1 , . . . , x m ) in direction x i at a = ( a 1 , . . . , a m ) : ∂ f ( a 1 , . . . , a i + h, . . . , a m ) − f ( a 1 , . . . , a i , . . . , a m ) ∂x i f ( a ) = lim h h → 0 8

  21. Calculus Functions of multiple variables f : R m → R ◮ Partial derivative of f ( x 1 , . . . , x m ) in direction x i at a = ( a 1 , . . . , a m ) : ∂ f ( a 1 , . . . , a i + h, . . . , a m ) − f ( a 1 , . . . , a i , . . . , a m ) ∂x i f ( a ) = lim h h → 0 ◮ Gradient (assuming f is differentiable everywhere): � ∂f � ∂f � � ∂x 1 , ∂f ∂x 2 , . . . , ∂f ∂x 1 ( a ) , . . . , ∂f ∇ x f = s.t. ∇ x f ( a ) = ∂x m ( a ) ∂x m ◮ Points in direction of steepest ascent ◮ Critical point if ∇ x f ( a ) = 0 8

Recommend


More recommend