l ecture 17 s warm i ntelligence 3 c lassical o
play

L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I - PowerPoint PPT Presentation

15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy


  1. 15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO

  2. P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy • P for an individual particle / agent individual P swarm / neighborhood 2

  3. A T Y P I C A L S WA R M - L E V E L B E H A V I O R … 3

  4. W H A T I F W E H A V E O N E S I N G L E A G E N T … • PSO leverages the presence of a swarm: the outcome is truly a collective behavior • If left alone, each individual agent would behave like a hill-climber when moving in the direction of a local optimum, and then it will have a quite hard time to escape it A single agent doesn’t look impressive 🤕 … How can a single agent be smarter ? 4

  5. A ( F I R S T ) G E N E R A L A P P R O A C H • What is a good direction p ? • How small / large / constant / variable should be the step size 𝜷 k ? • How do we check that we are at the minimum? • This could work for a local optimum, but what about finding the global optimum? 5

  6. I F W E H A V E T H E F U N C T I O N ( A N D I T S D E R I VA T I V E S ) ! " ≈ ! " % + ' ( !(" % ) * ( " − " % ) • From 1st order Taylor series: Equation of the tangent plane to 𝒈 in 𝒚 0 : the gradient vector is orthogonal to the tangent plane • The partial derivatives determine the slope of the plane • In each point 𝒚 , the gradient vector is orthogonal to the isocountours, { 𝒚 : 𝒈 ( 𝒚 )= 𝑑 } • ⇒ The gradient vector points in the direction of maximal change of the function in the point. The magnitude of the gradient is the (max) velocity of change for the function 6

  7. G R A D I E N T S A N D R A T E O F C H A N G E • Directional derivative for a function 𝒈 : ℝ n ⟼ ℝ is the rate of change of the function along a given direction 𝒘 (where 𝒘 is a unit-norm vector): • E.g., for a function of two variables and a direction 𝒘 =( 𝑤 x , 𝑤 y ) • Partial derivatives are directional derivatives along the vectors of the canonical basis of ℝ n 7

  8. G R A D I E N T S A N D R A T E O F C H A N G E • Theorem: Given a direction 𝒘 and a di ff erentiable function 𝒈 , then in each point 𝒚 0 of the function domain the following relation exists between the gradient vector and the directional derivative: • Corollary 1: only when the gradient is parallel to the • Corollary 2: directional derivative ( 𝜾 =0) that is, the directional derivative in a point gets its maximum value when the direction 𝒘 is the same as the gradient vector • ⇒ In each point, the gradient vector corresponds to the direction of maximal change of the function • ⇒ The norm of the gradient corresponds to the max velocity of change in the point 8

  9. G R A D I E N T D E S C E N T / A S C E N T • Move in the direction opposite to (min) to or aligned with (max) the gradient vector, the direction of maximal change of the function 9

  10. O N LY L O C A L O P T I M A The final local optimum depends on where we start from (for non-convex functions) 10

  11. ~ M O T I O N I N A P O T E N T I A L F I E L D GD run ~ Motion of a mass in a potential field towards the minimum energy § configuration. At each point the gradient defines the attraction force, while the step ! scales the force, to define the next point 11

  12. G R A D I E N T D E S C E N T A L G O R I T H M ( M I N ) 1. Initialization (a) Definition of a starting point x 0 (b) Definition of a tolerance parameter for converence ✏ (c) Initialization of the iteration variable, k 0 2. Computation of a feasible direction for moving • d k �r k f ( x k ) 3. Definition of the feasible (max) length of the move • ↵ k min α f ( x k + ↵ d k ) (1-dimensional problem in ↵ 2 R , the farthest point to where f ( x k + ↵ d k ) keeps increasing) 4. Move to new point in the direction of gradient descent • x k +1 x k + ↵ k d k 5. Check for convergence • If k ↵ k d k k < ✏ [or, if ↵ k  c ↵ 0 , where c > 0 is a small constant] (i.e., gradient becomes zero) (a) Output: x ∗ = x k +1 , f ( x ∗ ) = f ( x k +1 ) (b) Stop • Otherwise, k k + 1 , and go to Step 2 12

  13. ⇒ S T E P B E H A V I O R • If the gradient in 𝒚 0 points to the local minimum, then a line search would determine a step size 𝜷 k that would take directly to the minimum • However, this lucky case only happens in perfectly conditioned functions, or for a restricted set of points • It might be heavy to solve a 1D optimization problem at each step, such that some approximate methods can be preferred • In the general case, the moving directions 𝒆 k are perpendicular to each other : 𝜷 * minimizes 𝒈 ( 𝒚 k + 𝜷 * 𝒆 k ), such that 𝑒𝒈 / 𝑒𝜷 must be zero in 𝜷 * but, 13

  14. I L L - C O N D I T I O N E D P R O B L E M S Ill-conditioned problem Well-conditioned problem • If the function is very anisotropic , then the problem is said ill-conditioned , since the gradient vector doesn’t point to the direction of the local minimum, resulting into a zig-zagging trajectory • Ill-conditioning can be determined by computing the ratio between the eigenvalues of the Hessian matrix (the matrix of the second partial derivatives) 14

  15. W H A T A B O U T A C O N S TA N T S T E P S I Z E ? ! is too large, divergence Small, good ! , convergence 15

  16. W H A T A B O U T A C O N S TA N T S T E P S I Z E ? ! is too large, divergence ! is too low, slow convergence • Adapting the step size may be necessary to avoid either a too slow progress, or overshooting the target minimum 16

  17. E X A M P L E : S U P E R V I S E D M A C H I N E L E A R N I N G • Loss function: Sum of squared errors: • 𝑛 labeled training samples ( examples ) • 𝑧 ( 𝑗 ) = known correct value for sample 𝑗 • 𝑦 ( 𝑗 ) · 𝜄 = linear hypothesis function, 𝜄 vector of parameters • Goal: find the value of the parameter vector 𝜄 such that the loss (errors in classification / regression) is minimized (over the training set) • Any analogy with PSO? Gradient: Ø If the averaging factor 1/2& is used, then the update action becomes: / ! ← ! − % & ' ( (*) ( (*), - ! − . (*) 17 *01

  18. S T O C H A S T I C G R A D I E N T D E S C E N T 18

  19. A D A P T I V E S T E P S I Z E ? • Problem min α f ( x k + α d k ) can be tackled using any algorithm for one-dimensional search: Fibonacci, Golden section, Newton’s method, Secant method , . . . • If f 2 C 2 , it’s possible to compute the Hessian H k , and α k can be determined analytically : – Let y k = α d k be the (feasible) small displacement applied in x k , such that from Taylor’s series about x k , truncated at the 2nd order (quadratic approximation): k r k + 1 f ( x k + y k ) ⇡ f ( x k ) + y T 2 y T k H k y k – In the algorithm, d k is the direction of steepest descent, y k = � α r k , such that: k r k + 1 f ( x k � α r k ) ⇡ f ( x k ) � α r T 2 α 2 r T k H k r k – We are looking for α minimizing f ! first order conditions must be satisfied, d f d α = 0 : d f r T f ( x k � α r k ) k r k d ⇡ �r T k r k + α r T k H k r k = 0 ) α = α k ⇡ r T k H k r k d α – Updating rule of the gradient descent algorithm becomes: r T k r k x k +1 x k � r k r T k H k r k 19

  20. C O N D I T I O N S O N T H E H E S S I A N M A T R I X r T k r k x k +1 x k � r k r T k H k r k • The value found for α k is a sound estimate of the ”correct” value of α as long as the quadratic approximation of the Taylor series is an accurate approximation of the f ( x ) . • At the beginning of the iterations, k y k k will be quite large since the approximation will be inaccurate, but getting closer to the minimum k y k k will decrease accordingly, and the accuracy will keep increasing. • If f is a quadratic function, the Taylor series is exact, and we can use = instead of ⇡ ) α k is exact at each iteration • The Hessian matrix is the matrix of 2nd order partial derivatives. E.g., for f : X ✓ R 3 7! R , the Hessian matrix computed in a point x ∗ :   ∂ 2 f ∂ 2 f ∂ 2 f ∂ x 2 ∂ x 1 ∂ x 2 ∂ x 1 ∂ x 3   1     ∂ 2 f ∂ 2 f ∂ 2 f   H | x ∗ =   ∂ x 2 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3   2     ∂ 2 f ∂ 2 f ∂ 2 f   ∂ x 2 ∂ x 3 ∂ x 1 ∂ x 3 ∂ x 2 20 3 | x ∗

  21. P R O P E R T I E S O F T H E H E S S I A N M A T R I X • Theorem (Schwartz) : If a function is C 2 (twice di ff erentiable with continuity) in R n ! The order of derivation is irrelevant. ) The Hessian matrix is symmetric • Theorem (Quadratic form of a matrix) : Given a matrix H 2 R n × n , squared and symmetric, the associated quadratic form is defined as the function: Positive definite (convex) q ( x ) = 1 2 x T Hx 2 The matrix is said: The matrix is said: – Positive definite if x T Hx > 0 , 8 x 2 R n , x 6 = 0 – Positive semi-definite if x T Hx � 0 , 8 x 2 R n – Negative definite if x T Hx < 0 , 8 x 2 R n , x 6 = 0 – Negative semi-definite if x T Hx  0 , 8 x 2 R n – Indefinite if x T Hx > 0 for some x and x T Hx < 0 for others x Negative definite (concave) Positive semi-definite Indefinite (saddle) 21

Recommend


More recommend