Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan - PowerPoint PPT Presentation

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani

Review • Matrix differentials: sol’n to matrix calculus pain ‣ compact way of writing Taylor expansions, or … ‣ definition: ‣ df = a(x; dx) [+ r(dx)] ‣ a(x; .) linear in 2nd arg ‣ r(dx)/||dx|| → 0 as dx → 0 • d(…) is linear: passes thru +, scalar * • Generalizes Jacobian, Hessian, gradient, velocity Geoff Gordon—10-725 Optimization—Fall 2012 2

Review • Chain rule • Product rule • Bilinear functions: cross product, Kronecker, Frobenius, Hadamard, Khatri-Rao, … • Identities ‣ rules for working with � , tr() ‣ trace rotation • Identification theorems Geoff Gordon—10-725 Optimization—Fall 2012 3

Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T d x df = tr(A T dX) 2 1.5 1 0.5 0 � 0.5 � 1 � 3 � 2 � 1 0 1 2 3 Geoff Gordon—10-725 Optimization—Fall 2012 4

Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T d x df = tr(A T dX) Geoff Gordon—10-725 Optimization—Fall 2012 5

And so forth… • Can’t draw it for X a matrix, tensor, … • But same principle holds: set coefficient of dX to 0 to find min, max, or saddle point: ‣ if df = c(A; dX) [+ r(dX)] then ‣ so: max/min/sp iff ‣ for c(.; .) any “product”, Geoff Gordon—10-725 Optimization—Fall 2012 6

10 Ex: Infomax ICA 5 0 x i � 5 • Training examples x i ∈ ℝ d , i = 1:n � 10 � 10 � 5 0 5 10 • Transformation y i = g(Wx i ) 10 5 ‣ W ∈ ℝ d ! d 0 ‣ g(z) = � 5 Wx i • Want: � 10 � 10 � 5 0 5 10 0.8 0.6 0.4 0.2 y i Geoff Gordon—10-725 Optimization—Fall 2012 23 0.2 0.4 0.6 0.8

Volume rule Geoff Gordon—10-725 Optimization—Fall 2012 8

10 Ex: Infomax ICA 5 0 • y i = g(Wx i ) x i � 5 � 10 � 10 � 5 0 5 10 ‣ dy i = 10 5 • Method: max W ! i –ln(P(y i )) 0 ‣ where P(y i ) = � 5 Wx i � 10 � 10 � 5 0 5 10 0.8 0.6 0.4 0.2 y i Geoff Gordon—10-725 Optimization—Fall 2012 24 0.2 0.4 0.6 0.8

Gradient • L = ! ln |det J i | y i = g(Wx i ) dy i = J i dx i i Geoff Gordon—10-725 Optimization—Fall 2012 10

Gradient J i = diag(u i ) W dJ i = diag(u i ) dW + diag(v i ) diag(dW x i ) W dL = Geoff Gordon—10-725 Optimization—Fall 2012 11

Natural gradient • L(W): R d " d → R dL = tr(G T dW) • step S = arg max S M(S) = tr(G T S) – ||SW -1 || 2 /2 F ‣ scalar case: M = gs – s 2 / 2w 2 • M = • dM = Geoff Gordon—10-725 Optimization—Fall 2012 12

ICA natural gradient • [W -T + C] W T W = y i Wx i start with W 0 = I Geoff Gordon—10-725 Optimization—Fall 2012 13

ICA on natural image patches Geoff Gordon—10-725 Optimization—Fall 2012 14

ICA on natural image patches Geoff Gordon—10-725 Optimization—Fall 2012 15

More info • Minka’s cheat sheet: ‣ http://research.microsoft.com/en-us/um/people/minka/ papers/matrix/ • Magnus & Neudecker. Matrix Differential Calculus . Wiley, 1999. 2nd ed. ‣ http://www.amazon.com/Differential-Calculus- Applications-Statistics-Econometrics/dp/047198633X • Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7, 1995. Geoff Gordon—10-725 Optimization—Fall 2012 16

Newton’s method 10-725 Optimization Geoff Gordon Ryan Tibshirani

Nonlinear equations • x ∈ R d f: R d → R d , diff’ble 1.5 ‣ solve: 1 • Taylor: 0.5 ‣ J: 0 • Newton: � 0.5 � 1 0 1 2 Geoff Gordon—10-725 Optimization—Fall 2012 18

Error analysis Geoff Gordon—10-725 Optimization—Fall 2012 19

dx = x*(1-x*phi) 0: 0 .7500000000000000 1: 0 .5898558813281841 2: 0.61 67492604787597 3: 0.61803 13181415453 4: 0.6180339887 383547 5: 0.6180339887498948 6: 0.618033988749894 9 7: 0.6180339887498948 8: 0.618033988749894 9 *: 0.6180339887498948 Geoff Gordon—10-725 Optimization—Fall 2012 20

Bad initialization 1.3000000000000000 -0.1344774409873226 -0.2982157033270080 -0.7403273854022190 -2.3674743431148597 -13.8039236412225819 -335.9214859516196157 -183256.0483360671496484 -54338444778.1145248413085938 Geoff Gordon—10-725 Optimization—Fall 2012 21

Minimization • x ∈ R d f: R d → R, twice diff’ble ‣ find: • Newton: Geoff Gordon—10-725 Optimization—Fall 2012 22

Descent • Newton step: d = –(f’’(x)) -1 f’(x) • Gradient step: –g = –f’(x) • Taylor: df = • Let t > 0, set dx = ‣ df = • So: Geoff Gordon—10-725 Optimization—Fall 2012 23

Steepest descent g = f’(x) H = f’’(x) x ||d|| H = x + ∆ x nsd x + ∆ x nt Geoff Gordon—10-725 Optimization—Fall 2012 24

Newton w/ line search • Pick x 1 • For k = 1, 2, … ‣ g k = f’(x k ); H k = f’’(x k ) gradient & Hessian ‣ d k = –H k \ g k Newton direction backtracking line search ‣ t k = 1 ‣ while f(x k + t k d k ) > f(x k ) + t g kT d k / 2 ‣ t k = β t k β <1 ‣ x k+1 = x k + t k d k step Geoff Gordon—10-725 Optimization—Fall 2012 25

Properties of damped Newton • Affine invariant: suppose g(x) = f(Ax+b) ‣ x 1 , x 2 , … from Newton on g() ‣ y 1 , y 2 , … from Newton on f() ‣ If y 1 = Ax 1 + b, then: • Convergent: ‣ if f bounded below, f(x k ) converges ‣ if f strictly convex, bounded level sets, x k converges ‣ typically quadratic rate in neighborhood of x* Geoff Gordon—10-725 Optimization—Fall 2012 26

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan - PowerPoint PPT Presentation

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani Review Matrix differentials: soln to matrix calculus pain compact way of writing Taylor expansions, or definition: df = a(x; dx) [+ r(dx)]

Science One Integral Calculus January 2017 Happy New Year! Differential Calculus central idea:

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Relational Calculus Another Theoretical QL-Relational Calculus Comes in two flavors: Tuple

Lecture 1 : Lambda Calculus CS6202 Introduction 1 Lambda Calculus Lambda Calculus

Whats Calculus? Answer: Next semester! (Fundamental Theorem of Calculus, by Newton and

Tutorial: Differential Categories and Cartesian Differential Categories JS Pacaud Lemay FMCS

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Modelling and Simulation of Mechatronic Systems 02PCYQW Examples Matrix Calculus Basilio Bona

Ito calculus, Malliavin calculus and Mathematical Finance Shigeo Kusuoka Mathematical Finance

Resequencing Calculus Existing Solutions An Early Multivariate Approach Our Solution Revised

Event calculus Problem Event Calculus I Constraint Logic Fritz Hamm Programming Event

Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni &

compareGroups updated: version 2.0 Isaac Subirana & Joan Vila & H ector Sanz &

Forking / Forming a Food Culture Wiki Shih-Chieh Ilya Li CITI, Academia Sinica & Sociology

The Herbivore in the Room Body-Shaming and Food- Shaming Alienate Vegans and Pre-gans Vegan

Ordinary Differential Equations a Refresher Andreas Adelmann PSI November 12, 2018 CAS 2018

Introduction Eigenmodes Convolution and Response ODEs and Linear Systems Functions Further

Population Modeling with Ordinary Differential Equations Michael J. Coleman November 6, 2006

Generalized Hermite reduction, Creative telescoping, and Definite integration of differentially

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan - PowerPoint PPT Presentation

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani Review Matrix differentials: soln to matrix calculus pain compact way of writing Taylor expansions, or definition: df = a(x; dx) [+ r(dx)]

Science One Integral Calculus January 2017 Happy New Year! Differential Calculus central idea:

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Relational Calculus Another Theoretical QL-Relational Calculus Comes in two flavors: Tuple

Lecture 1 : Lambda Calculus CS6202 Introduction 1 Lambda Calculus Lambda Calculus

Whats Calculus? Answer: Next semester! (Fundamental Theorem of Calculus, by Newton and

Tutorial: Differential Categories and Cartesian Differential Categories JS Pacaud Lemay FMCS

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Modelling and Simulation of Mechatronic Systems 02PCYQW Examples Matrix Calculus Basilio Bona

Ito calculus, Malliavin calculus and Mathematical Finance Shigeo Kusuoka Mathematical Finance

Resequencing Calculus Existing Solutions An Early Multivariate Approach Our Solution Revised

Event calculus Problem Event Calculus I Constraint Logic Fritz Hamm Programming Event

Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni &amp;

compareGroups updated: version 2.0 Isaac Subirana &amp; Joan Vila &amp; H ector Sanz &amp;

Forking / Forming a Food Culture Wiki Shih-Chieh Ilya Li CITI, Academia Sinica &amp; Sociology

The Herbivore in the Room Body-Shaming and Food- Shaming Alienate Vegans and Pre-gans Vegan

Ordinary Differential Equations a Refresher Andreas Adelmann PSI November 12, 2018 CAS 2018

Introduction Eigenmodes Convolution and Response ODEs and Linear Systems Functions Further

Population Modeling with Ordinary Differential Equations Michael J. Coleman November 6, 2006

Generalized Hermite reduction, Creative telescoping, and Definite integration of differentially

Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni &

compareGroups updated: version 2.0 Isaac Subirana & Joan Vila & H ector Sanz &

Forking / Forming a Food Culture Wiki Shih-Chieh Ilya Li CITI, Academia Sinica & Sociology