statistical aspects of quantum computing
play

Statistical Aspects of Quantum Computing Yazhen Wang Department of - PowerPoint PPT Presentation

Statistical Aspects of Quantum Computing Yazhen Wang Department of Statistics University of Wisconsin-Madison http://www.stat.wisc.edu/ yzwang Near-term Applications of Quantum Computing Fermilab, December 6-7, 2017 Yazhen (at UW-Madison)


  1. Statistical Aspects of Quantum Computing Yazhen Wang Department of Statistics University of Wisconsin-Madison http://www.stat.wisc.edu/ ∼ yzwang Near-term Applications of Quantum Computing Fermilab, December 6-7, 2017 Yazhen (at UW-Madison) 1 / 40

  2. Outline Statistical learning with quantum annealing Statistical analysis of quantum computing data Yazhen (at UW-Madison) 2 / 40

  3. Statistics and Optimization MLE/M-estimation, Non-parametric smoothing, · · · n L ( θ ; X n ) = 1 � • Stochastic optimization problem: min ℓ ( θ ; X i ) n θ i = 1 • Minimization solution gives an estimator or a classifier. Examples : ℓ ( θ ; X i ) = log pdf ; residual square sum / loss + penalty Yazhen (at UW-Madison) 3 / 40

  4. Statistics and Optimization MLE/M-estimation, Non-parametric smoothing, · · · n L ( θ ; X n ) = 1 � • Stochastic optimization problem: min ℓ ( θ ; X i ) n θ i = 1 • Minimization solution gives an estimator or a classifier. Examples : ℓ ( θ ; X i ) = log pdf ; residual square sum / loss + penalty Take g ( θ ) = E [ L ( θ ; X n )] = E [ ℓ ( θ ; X 1 )] • Optimization problem: min g ( θ ) θ • Minimization solution defines a true parameter value. Yazhen (at UW-Madison) 3 / 40

  5. Statistics and Optimization MLE/M-estimation, Non-parametric smoothing, · · · n L ( θ ; X n ) = 1 � • Stochastic optimization problem: min ℓ ( θ ; X i ) n θ i = 1 • Minimization solution gives an estimator or a classifier. Examples : ℓ ( θ ; X i ) = log pdf ; residual square sum / loss + penalty Take g ( θ ) = E [ L ( θ ; X n )] = E [ ℓ ( θ ; X 1 )] • Optimization problem: min g ( θ ) θ • Minimization solution defines a true parameter value. Goals: Use data X n to do the following (i) Evaluate estimators/classifiers (minimization solutions) Computing (ii) Statistical study of estimators/classifiers – Inference Yazhen (at UW-Madison) 3 / 40

  6. Computer Power Demand Yazhen (at UW-Madison) 4 / 40

  7. Computer Power Demand BIG DATA Yazhen (at UW-Madison) 4 / 40

  8. Computer Power Demand Scientific Studies and BIG DATA Computational Applications Yazhen (at UW-Madison) 4 / 40

  9. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. Yazhen (at UW-Madison) 5 / 40

  10. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. History Yazhen (at UW-Madison) 5 / 40

  11. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. History Dog vs cat Yazhen (at UW-Madison) 5 / 40

  12. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. Neural network: Layers in a chain structure Each layer is a function of the layer preceded it. Layer j : h j = g j ( a j h j − 1 + b j ) , ( a j , b j ) = weights, g j = activation function (sigmoid, softmax or rectifier) History Dog vs cat Yazhen (at UW-Madison) 5 / 40

  13. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Yazhen (at UW-Madison) 6 / 40

  14. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Yazhen (at UW-Madison) 6 / 40

  15. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = derivative = dX t ˙ ˙ Differential equation: X t + ∇ g ( X t ) = 0 , dt Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Yazhen (at UW-Madison) 6 / 40

  16. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = derivative = dX t ˙ ˙ Differential equation: X t + ∇ g ( X t ) = 0 , dt Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = d 2 X t X t + 3 Differential equation: ¨ ˙ ¨ X t + ∇ g ( X t ) = 0 , dt 2 t Yazhen (at UW-Madison) 6 / 40

  17. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = derivative = dX t ˙ ˙ Differential equation: X t + ∇ g ( X t ) = 0 , dt Convergence to the minimization solution at rate= 1 / k or 1 / t ( ↑ ) as t , k → ∞ . For the ccelerated case: Rate = 1 / k 2 or 1 / t 2 ( ↓ ) Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = d 2 X t X t + 3 Differential equation: ¨ ˙ ¨ X t + ∇ g ( X t ) = 0 , dt 2 t Yazhen (at UW-Madison) 6 / 40

  18. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 Yazhen (at UW-Madison) 7 / 40

  19. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 BigData: expensive to evaluate all ∇ ℓ ( θ ; X i ) at each iteration • Replace ∇L ( θ ; X n ) by m m ) = 1 ∇ ˆ L m ( θ ; X ∗ � ∇ ℓ ( θ ; X ∗ j ) , m ≪ n m j = 1 X ∗ m = ( X ∗ 1 , · · · , X ∗ m ) = subsample of X n (minibatch or bootstrap sample). Yazhen (at UW-Madison) 7 / 40

  20. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 BigData: expensive to evaluate all ∇ ℓ ( θ ; X i ) at each iteration • Replace ∇L ( θ ; X n ) by m m ) = 1 ∇ ˆ L m ( θ ; X ∗ � ∇ ℓ ( θ ; X ∗ j ) , m ≪ n m j = 1 X ∗ m = ( X ∗ 1 , · · · , X ∗ m ) = subsample of X n (minibatch or bootstrap sample). Stochastic gradient descent algorithm x ∗ k = x ∗ k − 1 − δ ∇ ˆ L m ( x ∗ k − 1 ; X ∗ m ) Yazhen (at UW-Madison) 7 / 40

  21. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 BigData: expensive to evaluate all ∇ ℓ ( θ ; X i ) at each iteration • Replace ∇L ( θ ; X n ) by m m ) = 1 ∇ ˆ L m ( θ ; X ∗ � ∇ ℓ ( θ ; X ∗ j ) , m ≪ n m j = 1 X ∗ m = ( X ∗ 1 , · · · , X ∗ m ) = subsample of X n (minibatch or bootstrap sample). Stochastic gradient descent algorithm x ∗ k = x ∗ k − 1 − δ ∇ ˆ L m ( x ∗ k − 1 ; X ∗ m ) Continuous curve X ∗ t to approximate discrete { x ∗ k : k ≥ 0 } X ∗ t obeys stochastic differential equation. Yazhen (at UW-Madison) 7 / 40

  22. Gradient Descent vs Stochastic Gradient Descent Gradient Descent Yazhen (at UW-Madison) 8 / 40

  23. Gradient Descent vs Stochastic Gradient Descent Gradient Descent Stochastic gradient descent Yazhen (at UW-Madison) 8 / 40

  24. Statistical Analysis of Gradient Descent (Wang, 2017) Continuous curve model Stochastic differential equation: dX ∗ t + ∇ g ( X ∗ t ) dt + σ ( X ∗ t ) dW t = 0 W t = Brownian motion For the accelerated case: 2nd order stochastic differential equation Yazhen (at UW-Madison) 9 / 40

  25. Statistical Analysis of Gradient Descent (Wang, 2017) Continuous curve model Stochastic differential equation: dX ∗ t + ∇ g ( X ∗ t ) dt + σ ( X ∗ t ) dW t = 0 W t = Brownian motion For the accelerated case: 2nd order stochastic differential equation and their asymptotic distribution as m , n → ∞ via stochastic differential equations Yazhen (at UW-Madison) 9 / 40

Recommend


More recommend