probability and statistics
play

Probability*and*Statistics* ! ! for*Computer*Science** - PowerPoint PPT Presentation

Probability*and*Statistics* ! ! for*Computer*Science** All!models!are!wrong,!but!some! models!are!useful555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Contents* Markov!chain!


  1. Probability*and*Statistics* ! ! for*Computer*Science** “All!models!are!wrong,!but!some! models!are!useful”555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020!

  2. Contents* � Markov!chain! � MoQvaQon! � DefiniQon!of!Markov!model! � Graph!representaQon!–!Markov!chain! � TransiQon!probability!matrix! � The!staQonary!Markov!chain! � The!pageRank!algorithm! !

  3. Project review : the have exercises do why we part I 2 It ? in . ? for expected each ex what are 7 mean notations What che do .

  4. SP 2020 project CS 361 Order Optimization First (1) Stochastic ( 65 pts ) order approximation first stochastic * hcxt x*= ? - task ? this is what have do what does this to with optimization ?

  5. ⇒ root finding # 1=0 hey pfcx*)=o ⇒ optimization of optimization context in the ! ! the parameters X . are ! dip che hyperplane of ie - / sum " x c- 5=0 a classifier

  6. ⇒ don't know hcx ) but If we we know gcx ) = hcx ) t Z independent noise random is Z a to x & Ecgcx ) ) = hcxj random ? is hex ) ECZI ? what is E- Chex ) TEC -27 E- ( gcxl ) = - hex , t EE -27 hcxl = EEZ ) -_ o

  7. CS 361: Probability and Statistics for Computer Science (Spring 2020) Stochastic First Order Optimization 1 Stochastic Approximation Root-finding is simply the process of finding where h ( x ) = 0. For simple polynomials (e.g. h ( x ) = ( x � 5)( x +3)), this is very easy. However, this is not easy for all functions. For instance, say we want to optimize a machine learning algorithm. We can define f ( x ) to be the error function for an algorithm, which we want as small as pos- sible. In order to do so, we would need to find the root of the derivative of the error function (ie, h ( x ) = f 0 ( x )), since this is where the minimum of the error function might be. Additionally, we may have to worry about noise . Say we want to find where h ( x ) = 0, but finding the true value of h ( x ) at some x is extremely expensive or impossible. On the bright side, we have access to a “noisy” version of h that we call g ( x ). In other words, g ( x ) = h ( x ) + z . You cannot control the additive noise z or predict it, but you can assume that it is independent of x , and E [ z ] = 0. Stochastic approximation (SA) is the process of root-finding on a noisy function g ( x ). 1.1 Stochastic Approximation in simple setting For stochastic approximation to be e ff ective, we need a sequence of positive learning rates that we denote I - as { η n } n � 1 . In the following exercises, we will perform stochastic approximation on h ( x ), and have access to a noisy version y = h ( x ) + z . In order to find a good sequence of learning rates, we need to make the following assumptions: 1. The function h has a unique root x ∗ (i.e., h ( x ⇤ ) = 0 for a unique x ⇤ ). This unique root is a positive zero ÷ :# ' in . . . . - is ! crossing of h . In other words: - Ia - x > x ⇤ ) h ( x ) > 0 , x < x ⇤ ) h ( x ) < 0 h ( x ⇤ ) = 0 , . 2. y has a finite upper and lower bound. In other words, we have bounded noise: . iii. P ( | y | < c ) = 1 where c � 0 (1.1) 3. The noise is independent of x : 8 x : E [ y ] = h ( x ) , P ( z | x ) = P ( z ) (1.2) 4. The learning rates η n do not approach 0 too quickly or too slowly. More formally: . . . . 1 1 X X η 2 η n = 1 , n = c (1.3) E n =1 n =1 n For some positive c . In other words, the sum of the learning rates is unbounded, but the sum of their in Il good ? V squares is bounded. nd ¥1 " In good ? = 2 so Exercise 1. (4 points) Propose a family of learning rates that satisfies assumption 4 (a formal proof is not needed). hint : Try providing a range of values for α in n α that would satisfy the constraints. . ? £ Now that we have a series of learning rates, we can move onto stochastic approximation. The algorithm is defined as follows, where X n is the n th approximation of x ⇤ : 2 + t a • Let X 1 be some initial value or guess n - 1 Learning rate 1 you ,=Xn - 7nF

  8. X * approximation of Xn nth is = Xn - Mn Tn Xnei → x* ? Xn n → is as happen stochastically ? Will this ( im ECCXN - HTT o = so n -

  9. 1. 2 stochastichmmoximnttimcowvergenf.in -cheatorement#dttg which elaborated are steps * There are project for the complicated too them as part of to selected use We * you exercises for . results are intermediate the Some of * provided like you to about learn We'd ! * expectation conditional

  10. ' Tn of is sequence a – Stochastic First Order Optimization 2 variables random another Xn • For n = 1 , 2 , · · · perform the following iteration is . . . ① ① X n +1 = X n � η n Y n (1.4) Where Y n = h ( X n ) + Z n (i.e., the noisy version of h ( X n )), just as mentioned previously. - Yu - i Th Xn = Xu - I 1.2 Convergence proof of SA - I Now that SA is defined, we want to show that it actually works. To do so, we can define an expression for the error, and show that expression converges to 0. Define the Mean Squared Error at step n as follows - ✓ e 2 n = E [( X n � x ⇤ ) 2 ] (1.5) Exercise 2. RV continuous 1. (4 points) Prove the following relationship for any two random variables u, v : the ECE Cfunlul ) in Ecg E u [ f ( u )] = E v [ E u | v [ f ( u ) | v ]] (1.6) = Effort ] context a Do not assume any kind of independence. We can summarize this relation as E [ A ] = E [ E [ A | B ]]. Hint: It requires notion of conditional expectation ( E u | v [ f ( u ) | v ]). Here is a resource to learn about conditional expectation. You are free to find and use others. Ultimately, we want to show that the mean squared error will converge to 0 as the number of steps approaches 1 . To do so, we’ll need the following relationships: 1 e 2 n +1 = e 2 n � 2 η n ρ n + η 2 n E [ Y 2 n ] (1.7) where ρ n = E { ( X n � x ⇤ ) h ( X n ) } , and Y n is still the noisy version of h ( X n ). This shows us the relationship between two subsequent iterations of the mean squared error. n n e 2 n +1 = e 2 X X η 2 i E [ Y 2 1 � 2 η i ρ i + i ] (1.8) P ( 12/24=1 example c > o i =1 i =1 & 8 → Lec pg 23 2. (3 points) Knowing that noise is bounded (1.1), show that E [ Y 2 n ] is also bounded. notes 0 F. bike n ] is bounded, show that P n 3. (3 points) Given that E [ Y 2 i =1 η 2 i E [ Y 2 i ] is bounded. Hint: use (1.3) o 4. (2 points) Let b n := | X 1 | + c P n h ( x ) - i =1 η i and d n = min | x | <b n x � x ∗ . For this problem, the actual values of d n and b n are unimportant - we can show that P 1 i =1 η i d i = 1 and P 1 i =1 η i d i e 2 i < 1 . Using these two evidences, and assuming e 2 n converges, finalize the proof by proving the following: =f' Cx ) n !1 e 2 I > o lim n = 0 µ *¥ ' " f ' 's , Stochastic First Order Optimization 2 2 2.1 Review The goal of optimization is to find the x ⇤ that minimizes of f ( x ). However, f is again either unknown or very expensive to collect, but we have access to the noisy version g ( x ). E [ g ( x )] = f ( x ) (2.1) We also assume that we have access to the gradient of g , which is also noisy. E [ r g ( x )] = r f ( x ) (2.2) 1 Extra Credit Ex. 1 asks you to prove the statements we provided without proof in Exercise 2 and may help increase your mathematical understanding of error bounding. 2 Before continuing, you may consider attempting Extra Credit Ex. 2, 3, and 4. These exercises ask you to analyze some properties of SA and the order of convergence under SA settings. Again, these are not required.

  11. L Cfo C > o lion en = c . so n - a N is there HE > o n > N lent else - ← en '- c se en '_ c > - E en ' > c - E ( WH g - I en ' - > I ,

  12. di o ~ e :3 E I 7 ; ei - ca - = E. hiei 't E. , nice ? B di i - di - i es di > I 7 , ⇒ ¥ 7 . - di e ? = - - e : 't It , Yi ca di lo ⇒ contradiction T is - - di = - ⇒ I ni di C T 7 . is E- I i - - N

  13. en E E C l X n - x * I ' ] - x * IT En et = E C l Xue i - x * I y - y Ya = E C l X n - y Ya - x * T / xn ) = E ( E l Hn relate to e T en et with

  14. Conditional expectation discrete Rv . for have seen this : we 2- xpcx , E- Ex ] = 7C - g) = I xp cx=xlY=y , → ECXIY - I

  15. The mean of EX I Y]· Law of -tera ed expectations = Expert y , • g(y) = E[X I y = y ] = [X [ [ I = E Cgcy ) ) pcxlg ,=Pg , peg , = I 814 , pity , y X pl Xly , pity , = I I y x -2 xp ex , y , = z X - , y , = § xp CX ) = Efx ) y x Ey pix Ty = �

  16. ECXIY ) about What variable ? random continues for E- EXIT ) = S xpcxly > da T density K E- ( tix ) 141=5 tcxspcxlysdx ' T X density EREC # 147 ) when for Ex . 2 . ' → , Y X ane J continuous RVs , y > Ix pix × - Y ) p = -

  17. Random Stick-break-ng exa pie p fy(y) = ? I ECT ) t z • Stick example. stick of length £ e random Ef variable ( / / break at uniform y chosen point Y chosen po·nt X E y break what ·s left at uniformly o T ? I SK pcxly )d × yp E[X I y - y) = = Sj x. tydx-g-T.KZ n I E[X I Y] = z n y X ECELXIYI ) = ICE ] E[X] = whether I Does matter it the ECT ) break the left from or = = I right ? � 4

  18. SP 2020 project CS 361 Order Optimization First (1) Stochastic ( 65 pts ) order approximation first stochastic * first order optimization stochastic # . GD and SGD

Recommend


More recommend