g ( x ) := E [ Y | X = x ] := yPr [ Y = y | X = x ] . Recall that L - PowerPoint PPT Presentation

CS70: Jean Walrand: Lecture 31. Review Nonlinear Regression: Motivation Definitions Let X and Y be RVs on Ω . There are many situations where a good guess about Y given X is not linear. ◮ Joint Distribution: Pr [ X = x , Y = y ] Nonlinear Regression E.g., (diameter of object, weight), (school years, income), (PSA level, ◮ Marginal Distribution: Pr [ X = x ] = ∑ y Pr [ X = x , Y = y ] cancer risk). ◮ Conditional Distribution: Pr [ Y = y | X = x ] = Pr [ X = x , Y = y ] 1. Review: joint distribution, LLSE Pr [ X = x ] ◮ LLSE: L [ Y | X ] = a + bX where a , b minimize E [( Y − a − bX ) 2 ] . 2. Quadratic Regression 3. Definition of Conditional expectation 4. Properties of CE We saw that 5. Applications: Diluting, Mixing, Rumors L [ Y | X ] = E [ Y ]+ cov ( X , Y ) ( X − E [ X ]) . 6. CE = MMSE var [ X ] Our goal: explore estimates ˆ Y = g ( X ) for nonlinear functions g ( · ) . Recall the non-Bayesian and Bayesian viewpoints. Quadratic Regression Conditional Expectation Deja vu, all over again? Let X , Y be two random variables defined on the same probability space. Have we seen this before? Yes. Definition Let X and Y be RVs on Ω . The conditional Definition: The quadratic regression of Y over X is the random expectation of Y given X is defined as variable Is anything new? Yes. Q [ Y | X ] = a + bX + cX 2 The idea of defining g ( x ) = E [ Y | X = x ] and then E [ Y | X ] = g ( X ) where a , b , c are chosen to minimize E [( Y − a − bX − cX 2 ) 2 ] . E [ Y | X ] = g ( X ) . Derivation: We set to zero the derivatives w.r.t. a , b , c . We get where Big deal? Quite! Simple but most convenient. E [ Y − a − bX − cX 2 ] 0 = g ( x ) := E [ Y | X = x ] := ∑ yPr [ Y = y | X = x ] . Recall that L [ Y | X ] = a + bX is a function of X . E [( Y − a − bX − cX 2 ) X ] 0 = y This is similar: E [ Y | X ] = g ( X ) for some function g ( · ) . Fact E [( Y − a − bX − cX 2 ) X 2 ] 0 = E [ Y | X = x ] = ∑ Y ( ω ) Pr [ ω | X = x ] . In general, g ( X ) is not linear, i.e., not a + bX . It could be that ω We solve these three equations in the three unknowns ( a , b , c ) . g ( X ) = a + bX + cX 2 . Or that g ( X ) = 2sin ( 4 X )+ exp {− 3 X } . Or Proof: E [ Y | X = x ] = E [ Y | A ] with A = { ω : X ( ω ) = x } . something else. Note: These equations imply that E [( Y − Q [ Y | X ]) h ( X )] = 0 for any h ( X ) = d + eX + fX 2 . That is, the estimation error is orthogonal to all the quadratic functions of X . Hence, Q [ Y | X ] is the projection of Y onto the space of quadratic functions of X .

Properties of CE Properties of CE Properties of CE E [ Y | X = x ] = ∑ E [ Y | X = x ] = ∑ yPr [ Y = y | X = x ] yPr [ Y = y | X = x ] y y E [ Y | X = x ] = ∑ Theorem yPr [ Y = y | X = x ] Theorem (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; y (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; Theorem (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (e) E [ E [ Y | X ]] = E [ Y ] . (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; Proof: (continued) (e) E [ E [ Y | X ]] = E [ Y ] . (e) E [ E [ Y | X ]] = E [ Y ] . (d) E [ h ( X ) E [ Y | X ]] = ∑ h ( x ) E [ Y | X = x ] Pr [ X = x ] Proof: x Proof: (continued) = ∑ h ( x ) ∑ yPr [ Y = y | X = x ] Pr [ X = x ] (a),(b) Obvious (e) Let h ( X ) = 1 in (d). x y (c) E [ Yh ( X ) | X = x ] = ∑ Y ( ω ) h ( X ( ω ) Pr [ ω | X = x ] = ∑ h ( x ) ∑ yPr [ X = x , y = y ] ω = ∑ x y Y ( ω ) h ( x ) Pr [ ω | X = x ] = h ( x ) E [ Y | X = x ] = ∑ h ( x ) yPr [ X = x , y = y ] = E [ h ( X ) Y ] . ω x , y Properties of CE Application: Calculating E [ Y | X ] Application: Diluting Theorem Let X , Y , Z be i.i.d. with mean 0 and variance 1. We want to (a) X , Y independent ⇒ E [ Y | X ] = E [ Y ] ; calculate (b) E [ aY + bZ | X ] = aE [ Y | X ]+ bE [ Z | X ] ; E [ 2 + 5 X + 7 XY + 11 X 2 + 13 X 3 Z 2 | X ] . (c) E [ Yh ( X ) | X ] = h ( X ) E [ Y | X ] , ∀ h ( · ) ; (d) E [ h ( X ) E [ Y | X ]] = E [ h ( X ) Y ] , ∀ h ( · ) ; (e) E [ E [ Y | X ]] = E [ Y ] . At each step, pick a ball from a well-mixed urn. Replace it with a blue We find ball. Let X n be the number of red balls in the urn at step n . What is Note that (d) says that E [ X n ] ? E [ 2 + 5 X + 7 XY + 11 X 2 + 13 X 3 Z 2 | X ] = 2 + 5 X + 7 XE [ Y | X ]+ 11 X 2 + 13 X 3 E [ Z 2 | X ] Given X n = m , X n + 1 = m − 1 w.p. m / N (if you pick a red ball) and E [( Y − E [ Y | X ]) h ( X )] = 0 . X n + 1 = m otherwise. Hence, = 2 + 5 X + 7 XE [ Y ]+ 11 X 2 + 13 X 3 E [ Z 2 ] E [ X n + 1 | X n = m ] = m − ( m / N ) = m ( N − 1 ) / N = X n ρ , We say that the estimation error Y − E [ Y | X ] is orthogonal to = 2 + 5 X + 11 X 2 + 13 X 3 ( var [ Z ]+ E [ Z ] 2 ) every function h ( X ) of X . with ρ := ( N − 1 ) / N . Consequently, = 2 + 5 X + 11 X 2 + 13 X 3 . E [ X n + 1 ] = E [ E [ X n + 1 | X n ]] = ρ E [ X n ] , n ≥ 1 . We call this the projection property. More about this later. ⇒ E [ X n ] = ρ n − 1 E [ X 1 ] = N ( N − 1 ) n − 1 , n ≥ 1 . = N

Diluting Diluting Application: Mixing Here is a plot: By analyzing E [ X n + 1 | X n ] , we found that E [ X n ] = N ( N − 1 N ) n − 1 , n ≥ 1 . Here is another argument for that result. Consider one particular red ball, say ball k . At each step, it remains red w.p. ( N − 1 ) / N , when another ball is picked. Thus, the probability that it is still red at step n is [( N − 1 ) / N ] n − 1 . Let At each step, pick a ball from each well-mixed urn. We transfer them Y n ( k ) = 1 { ball k is red at step n } . to the other urn. Let X n be the number of red balls in the bottom urn at step n . What is E [ X n ] ? Then, X n = Y n ( 1 )+ ··· + Y n ( N ) . Hence, Given X n = m , X n + 1 = m + 1 w.p. p and X n + 1 = m − 1 w.p. q where p = ( 1 − m / N ) 2 (B goes up, R down) and q = ( m / N ) 2 (R goes E [ X n ] = E [ Y n ( 1 )+ ··· + Y n ( N )] = NE [ Y n ( 1 )] NPr [ Y n ( 1 ) = 1 ] = N [( N − 1 ) / N ] n − 1 . up, B down). = Thus, E [ X n + 1 | X n ] = X n + p − q = X n + 1 − 2 X n / N = 1 + ρ X n , ρ := ( 1 − 2 / N ) . Mixing Application: Mixing Application: Going Viral Consider a social network (e.g., Twitter). Here is the plot. You start a rumor (e.g., Walrand is really weird). We saw that E [ X n + 1 | X n ] = 1 + ρ X n , ρ := ( 1 − 2 / N ) . Hence, You have d friends. Each of your friend retweets w.p. p . Each of your friends has d friends, etc. E [ X n + 1 ] = 1 + ρ E [ X n ] E [ X 2 ] = 1 + ρ N ; E [ X 3 ] = 1 + ρ ( 1 + ρ N ) = 1 + ρ + ρ 2 N Does the rumor spread? Does it die out (mercifully)? E [ X 4 ] = 1 + ρ ( 1 + ρ + ρ 2 N ) = 1 + ρ + ρ 2 + ρ 3 N E [ X n ] = 1 + ρ + ··· + ρ n − 2 + ρ n − 1 N . Hence, E [ X n ] = 1 − ρ n − 1 + ρ n − 1 N , n ≥ 1 . 1 − ρ In this example, d = 4.

Application: Going Viral Application: Going Viral Application: Wald’s Identity Here is an extension of an identity we used in the last slide. Theorem Wald’s Identity Assume that X 1 , X 2 ,... and Z are independent, where Z takes values in { 0 , 1 , 2 ,... } An easy extension: Assume that everyone has an independent and E [ X n ] = µ for all n ≥ 1. number D i of friends with E [ D i ] = d . Then, the same fact holds. Then, Fact: Let X = ∑ ∞ n = 1 X n . Then, E [ X ] < ∞ iff pd < 1 . To see this, note that given X n = k , and given the numbers of friends E [ X 1 + ··· + X Z ] = µ E [ Z ] . D 1 = d 1 ,..., D k = d k of these X n people, one has Proof: X n + 1 = B ( d 1 + ··· + d k , p ) . Hence, Given X n = k , X n + 1 = B ( kd , p ) . Hence, E [ X n + 1 | X n = k ] = kpd . Proof: Thus, E [ X n + 1 | X n ] = pdX n . Consequently, E [ X n ] = ( pd ) n − 1 , n ≥ 1 . E [ X n + 1 | X n = k , D 1 = d 1 ,..., D k = d k ] = p ( d 1 + ··· + d k ) . E [ X 1 + ··· + X Z | Z = k ] = µ k . If pd < 1, then E [ X 1 + ··· + X n ] ≤ ( 1 − pd ) − 1 = ⇒ E [ X ] ≤ ( 1 − pd ) − 1 . Thus, E [ X n + 1 | X n = k , D 1 ,..., D k ] = p ( D 1 + ··· + D k ) . Thus, E [ X 1 + ··· + X Z | Z ] = µ Z . If pd ≥ 1, then for all C one can find n s.t. Consequently, E [ X n + 1 | X n = k ] = E [ p ( D 1 + ··· + D k )] = pdk . E [ X ] ≥ E [ X 1 + ··· + X n ] ≥ C . Hence, E [ X 1 + ··· + X Z ] = E [ µ Z ] = µ E [ Z ] . Finally, E [ X n + 1 | X n ] = pdX n , and E [ X n + 1 ] = pdE [ X n ] . In fact, one can show that pd ≥ 1 = ⇒ Pr [ X = ∞ ] > 0. We conclude as before. CE = MMSE CE = MMSE E [ Y | X ] and L [ Y | X ] as projections Theorem Theorem CE = MMSE E [ Y | X ] is the ‘best’ guess about Y based on X . g ( X ) := E [ Y | X ] is the function of X that minimizes Specifically, it is the function g ( X ) of X that E [( Y − g ( X )) 2 ] . minimizes E [( Y − g ( X )) 2 ] . Proof: Let h ( X ) be any function of X . Then E [( Y − h ( X )) 2 ] E [( Y − g ( X )+ g ( X ) − h ( X )) 2 ] = E [( Y − g ( X )) 2 ]+ E [( g ( X ) − h ( X )) 2 ] = + 2 E [( Y − g ( X ))( g ( X ) − h ( X ))] . But, E [( Y − g ( X ))( g ( X ) − h ( X ))] = 0 by the projection property . L [ Y | X ] is the projection of Y on { a + bX , a , b ∈ ℜ } : LLSE Thus, E [( Y − h ( X )) 2 ] ≥ E [( Y − g ( X )) 2 ] . E [ Y | X ] is the projection of Y on { g ( X ) , g ( · ) : ℜ → ℜ } : MMSE.

g ( x ) := E [ Y | X = x ] := yPr [ Y = y | X = x ] . Recall that L - PowerPoint PPT Presentation

CS70: Jean Walrand: Lecture 31. Review Nonlinear Regression: Motivation Definitions Let X and Y be RVs on . There are many situations where a good guess about Y given X is not linear. Joint Distribution: Pr [ X = x , Y = y ] Nonlinear

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

ALICE TPC Simulation and Reconstruction Tom Junk Fourth DUNE Near Detector Workshop March 22,

Globally Polarized Quark Gluon Plasma in Non-Central A+A Collisions at High Energies

Fitting Nonlinear Models to Data SI Model The SI model we discussed before is often written dS

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, Error Backpropagation [D] Chapt.

Nonparametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Density Estimation

Lecture 10: Nonparametric Regression (2) Applied Statistics 2015 1 / 18 Consistency of

The multiresolution criterion and nonparametric regression Thoralf Mildenberger and Henrike

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu shwu@cs.nthu.edu.tw Department

Nonparametric inference of interaction laws in particle/agent systems Fei Lu Department of

Explainable(?) Statistical ML Derek Doran Dept. of Computer Science and Engineering Wright

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause

Lecture 14: Local linear regression non-parametric estimation, perceptron and update algo, etc

50 Ways with GPs Richard Wilkinson School of Maths and Statistics University of Sheffield

CPSC 340: Machine Learning and Data Mining Non-Parametric Models Summer 2020 Course Map

Unsupervised Coreference Resolution in a Nonparametric Bayesian Model Aria Haghighi and Dan Klein

Handling parametric and non-parametric additive faults in LTV Systems Qinghua Zhang &

https://www.vhl.org/wp- content/uploads/2019/11/Active-Surveillance- Guidelines.pdf Guidelines

1 Linearity and Linear Systems Linear system is a kind of mapping f ( x ) y that

Eigenvalues and Eigenvectors Let A R n n be a matrix. If R and v R n , v = 0,

The Great SVD Mystery James H. Steiger Department of Psychology and Human Development Vanderbilt

g ( x ) := E [ Y | X = x ] := yPr [ Y = y | X = x ] . Recall that L - PowerPoint PPT Presentation

CS70: Jean Walrand: Lecture 31. Review Nonlinear Regression: Motivation Definitions Let X and Y be RVs on . There are many situations where a good guess about Y given X is not linear. Joint Distribution: Pr [ X = x , Y = y ] Nonlinear

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

ALICE TPC Simulation and Reconstruction Tom Junk Fourth DUNE Near Detector Workshop March 22,

Globally Polarized Quark Gluon Plasma in Non-Central A+A Collisions at High Energies

Fitting Nonlinear Models to Data SI Model The SI model we discussed before is often written dS

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, Error Backpropagation [D] Chapt.

Nonparametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Density Estimation

Lecture 10: Nonparametric Regression (2) Applied Statistics 2015 1 / 18 Consistency of

The multiresolution criterion and nonparametric regression Thoralf Mildenberger and Henrike

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu shwu@cs.nthu.edu.tw Department

Nonparametric inference of interaction laws in particle/agent systems Fei Lu Department of

Explainable(?) Statistical ML Derek Doran Dept. of Computer Science and Engineering Wright

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause

Lecture 14: Local linear regression non-parametric estimation, perceptron and update algo, etc

50 Ways with GPs Richard Wilkinson School of Maths and Statistics University of Sheffield

CPSC 340: Machine Learning and Data Mining Non-Parametric Models Summer 2020 Course Map

Unsupervised Coreference Resolution in a Nonparametric Bayesian Model Aria Haghighi and Dan Klein

Handling parametric and non-parametric additive faults in LTV Systems Qinghua Zhang &amp;

https://www.vhl.org/wp- content/uploads/2019/11/Active-Surveillance- Guidelines.pdf Guidelines

1 Linearity and Linear Systems Linear system is a kind of mapping f ( x ) y that

Eigenvalues and Eigenvectors Let A R n n be a matrix. If R and v R n , v = 0,

The Great SVD Mystery James H. Steiger Department of Psychology and Human Development Vanderbilt

Handling parametric and non-parametric additive faults in LTV Systems Qinghua Zhang &