Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October 11, 2017
About this lecture ◮ No Machine Learning without rigorous mathematics ◮ This should be the most boring lecture ◮ Serves as reference for notation used throughout the course ◮ If there are any holes make sure to fill them sooner than later ◮ Attempt Problem Sheet 0 to see where you are standing 1
Outline Today’s lecture ◮ Linear algebra ◮ Calculus ◮ Probability theory 2
Linear algebra We will mostly work in the real vector space: ◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = ( x 1 , . . . , x D ) ∈ R D of dimension D ◮ Matrix: two-dimensional array A ∈ R m × n written as · · · a 1 , 1 a 1 , 2 a 1 ,n · · · a 2 , 1 a 2 , 2 a 2 ,n A = . . . ... . . . . . . · · · a m, 1 a m, 2 a m,n 3
Linear algebra We will mostly work in the real vector space: ◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = ( x 1 , . . . , x D ) ∈ R D of dimension D ◮ Matrix: two-dimensional array A ∈ R m × n written as · · · a 1 , 1 a 1 , 2 a 1 ,n · · · a 2 , 1 a 2 , 2 a 2 ,n A = . . . ... . . . . . . · · · a m, 1 a m, 2 a m,n ◮ vector x is a R D × 1 matrix ◮ A i,j denotes a i,j ◮ A i, : denotes i -th row ◮ A : ,i denotes i -th column ◮ A T is the transpose of A such that ( A T ) i,j = A j,i ◮ symmetric if A = A T ◮ A ∈ R n × n is diagonal if A i,j = 0 for all i � = j ◮ I n is the n × n diagonal matrix s.t. ( I n ) i,i = 1 3
Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A 4
Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. B i,j = r · A i,j 4
Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. B i,j = r · A i,j ◮ Multiplication: C = A · B s.t. � A i,k · B k,j C i,j = 1 ≤ k ≤ n with A ∈ R m × n , B ∈ R n × p , C ∈ R m × p ◮ associative: A · ( B · C ) = ( A · B ) · C ◮ not commutative in general: A · B � = B · A ◮ distributive wrt. addition: A · ( B + C ) = A · B + A · C ◮ ( A · B ) T = B T · A T ◮ v and w are orthogonal if v T · w = 0 4
Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero 5
Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n 5
Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n 5
Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n ◮ A ∈ R n × n invertible if there is A − 1 ∈ R n × n s.t. A · A − 1 = A − 1 · A = I n 5
Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n ◮ A ∈ R n × n invertible if there is A − 1 ∈ R n × n s.t. A · A − 1 = A − 1 · A = I n ◮ Note that: ◮ A is invertible if rows of A are linearly independent ◮ equivalently if det( A ) � = 0 ◮ If A invertible then A · x = b has solution x = A − 1 · b 5
Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p � | v i | p � v � p = 1 ≤ i ≤ D 6
Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p � | v i | p � v � p = 1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R 6
Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p � | v i | p � v � p = 1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L 1 ◮ Eucledian norm L 2 ◮ Maximum norm L ∞ where � v � ∞ = max 1 ≤ i ≤ D | v i | 6
Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p � | v i | p � v � p = 1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L 1 ◮ Eucledian norm L 2 ◮ Maximum norm L ∞ where � v � ∞ = max 1 ≤ i ≤ D | v i | ◮ Vectors v , w ∈ R D are orthonormal if v and w are orthogonal and � v � 2 = � w � 2 = 1 6
Calculus Functions of one variable f : R → R ◮ First derivative: f ′ ( x ) = d f ( x + h ) − f ( x ) dxf ( x ) = lim h h → 0 ◮ f ′ ( x ∗ ) = 0 means that f ( x ∗ ) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x ∗ with smallest f ( x ∗ ) ◮ Second derivative test to (partially) decide nature of critical point 7
Calculus Functions of one variable f : R → R ◮ First derivative: f ′ ( x ) = d f ( x + h ) − f ( x ) dxf ( x ) = lim h h → 0 ◮ f ′ ( x ∗ ) = 0 means that f ( x ∗ ) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x ∗ with smallest f ( x ∗ ) ◮ Second derivative test to (partially) decide nature of critical point ◮ Differentiation rules: 1 d d d dxx n = n · x n − 1 dxa x = a x · ln( a ) dx log a ( x ) = x · ln( a ) ( f + g ) ′ = f ′ + g ′ ( f · g ) ′ = f ′ · g + f · g ′ ◮ Chain rule: if f = h ( g ) then f ′ = h ′ ( g ) · g ′ 7
Calculus Functions of multiple variables f : R m → R ◮ Partial derivative of f ( x 1 , . . . , x m ) in direction x i at a = ( a 1 , . . . , a m ) : ∂ f ( a 1 , . . . , a i + h, . . . , a m ) − f ( a 1 , . . . , a i , . . . , a m ) ∂x i f ( a ) = lim h h → 0 8
Calculus Functions of multiple variables f : R m → R ◮ Partial derivative of f ( x 1 , . . . , x m ) in direction x i at a = ( a 1 , . . . , a m ) : ∂ f ( a 1 , . . . , a i + h, . . . , a m ) − f ( a 1 , . . . , a i , . . . , a m ) ∂x i f ( a ) = lim h h → 0 ◮ Gradient (assuming f is differentiable everywhere): � ∂f � ∂f � � ∂x 1 , ∂f ∂x 2 , . . . , ∂f ∂x 1 ( a ) , . . . , ∂f ∇ x f = s.t. ∇ x f ( a ) = ∂x m ( a ) ∂x m ◮ Points in direction of steepest ascent ◮ Critical point if ∇ x f ( a ) = 0 8
Recommend
More recommend