∂ ifferentiate everything: A lesson from deep learning Lei Wang ( 王磊 ) https://wangleiphy.github.io Institute of Physics, CAS
Quantum Many-Body Computation U Deep Quantum Learning Computing Differentiable Programming
Differentiable Programming Andrej Karpathy Traditional Director of AI at Tesla. Previously Research Scientist at OpenAI and PhD student at Stanford. I like to train deep neural nets on large datasets. https://medium.com/@karpathy/software-2-0-a64152b37c35 Input Computer Output Program Machine Learning Input Computer Program Output Writing software 2.0 by gradient search in the program space
Differentiable Programming Andrej Karpathy Benefits of Software 2.0 Director of AI at Tesla. Previously Research Scientist at OpenAI and PhD student at Stanford. I like to train deep neural nets on large datasets. https://medium.com/@karpathy/software-2-0-a64152b37c35 • Computationally homogeneous • Simple to bake into silicon • Constant running time • Constant memory usage • Highly portable & agile • Modules can meld into an optimal whole • Better than humans Writing software 2.0 by gradient search in the program space
Demo: Inverse Schrodinger Problem Given ground state density, how to design the potential ? [ − 1 ∂ x 2 + V ( x ) ] Ψ ( x ) = E Ψ ( x ) ∂ 2 2 https://math.mit.edu/~stevenj/18.336/adjoint.pdf https://github.com/QuantumBFS/SSSS/blob/master/1_deep_learning/schrodinger.py
What is under the hood ?
What is deep learning ? Composes differentiable components to a program e.g. a neural network, then optimizes it with gradients
Automatic differentiation on computation graph θ 1 θ 2 weights ∂ x 2 ∂ x 3 θ 1 = x 2 θ 2 = x 3 ∂ θ 1 ∂ θ 2 ℒ “comb graph“ loss x 3 x 2 x 1 ∂ x 3 x 3 = ℒ ∂ℒ ℒ = 1 x 2 = x 3 data ∂ x 2 ∂ x 3 x = ∂ℒ “adjoint variable” ∂ x Pullback the adjoint through the graph
Automatic differentiation on computation graph x 1 x 2 ∂ x 2 x 1 = x 2 ∂ x 1 directed ℒ θ ∂ x 3 + x 3 acyclic graph ∂ x 1 x 3 ∂ x j ∑ x i = with ℒ = 1 x j ∂ x i j : child of i Message passing for the adjoint at each node
Advantages of automatic differentiation � � � • Accurate to the machine precision � • Same computational complexity as the function evaluation: � Baur-Strassen theorem ’83 � • Supports higher order gradients
Applications of AD Computing force Quantum optimal control Variational Hartree-Fock forward (evolution) backward (gradient) u 1 , 1 u 2 , 1 u 3 , 1 u 1 , 2 u 2 , 2 u 3 , 2 C × H 1 × H 2 × H 3 × H 1 × H 2 × H 3 H H + + Ψ C + 2 0 0 T e − i δ t e − i δ t Ψ 0 Ψ Ψ Ψ 2 1 N Ψ C 5 Ψ C 5 C 5 Ψ F F F 0 + + + Sorella and Capriotti Leung et al Tamayo-Mendoza et al J. Chem. Phys. ’10 PRA ’17 ACS Cent. Sci. ’18
More Applications… Langevin dynamics Sequence s Impute Structure X . . . Ingraham et al Protein Initialize . . . ICLR ‘19 folding Neural reparameterization Structural optimization Dense Conv Conv Conv Conv Structural Hoyer et al 1909.04240 optimization Physics Objective function CNN parameterization Design constraints (displacement) (compliance) Forward pass Gradients
Coil design in fusion reactors (stellarator) McGreivy et al 2009.00196
Computation graph McGreivy et al 1909.04240 Coil parameters Total cost Differentiable programming is more than training neural networks
Black Functional Chain magic box differential geometry rule https://colab.research.google.com/ github/google/jax/blob/master/ notebooks/autodiff_cookbook.ipynb Differentiating a general computer program (rather than neural networks) calls for deeper understanding of the technique
Reverse versus forward mode ∂ x n ⋯ ∂ x 2 ∂ x 1 ∂ℒ ∂ θ = ∂ℒ ∂ x n ∂ x n − 1 ∂ x 1 ∂ θ Reverse mode AD: Vector-Jacobian Product of primitives • Backtrace the computation graph • Needs to store intermediate results • Efficient for graphs with large fan-in Backpropagation = Reverse mode AD applied to neural networks
Reverse versus forward mode ∂ x n ⋯ ∂ x 2 ∂ x 1 ∂ℒ ∂ θ = ∂ℒ ∂ x n ∂ x n − 1 ∂ x 1 ∂ θ Forward mode AD: Jacobian-Vector Product of primitives • Same order with the function evaluation • No storage overhead • Efficient for graph with large fan-out Less efficient for scalar output, but useful for higher-order derivatives
How to think about AD ? • AD is modular, and one can control its granularity • Benefits of writing customized primitives • Reducing memory usage • Increasing numerical stability • Call to external libraries written agnostically to AD (or, even a quantum processor) https://github.com/PennyLaneAI/pennylane
Example of primitives ~200 functions to cover most of numpy in HIPS/autograd https://github.com/HIPS/autograd/blob/master/autograd/numpy/numpy_vjps.py � … Loop/Condition/Sort/Permutations are also differentiable
Differentiable programming tools HIPS/autograd SciML
Differentiable Scientific Computing • Many scientific computations (FFT, Eigen, SVD!) are differentiable • ODE integrators are differentiable with O(1) memory • Differentiable ray tracer and Differentiable fluid simulations • Differentiable Monte Carlo/Tensor Network/Functional RG/ Dynamical Mean Field Theory/Density Functional Theory/ Hartree-Fock/Coupled Cluster/Gutzwiller/Molecular Dynamics… Differentiate through domain-specific computational processes to solve learning, control, optimization and inverse problems
Differentiable Eigensolver Inverse Schrodinger Problem H Ψ ℒ V matrix diagonalization Useful for inverse Kohn-Sham problem, Jensen & Wasserman ‘17
Differentiable Eigensolver H Ψ = Ψ E What happen if H → H + dH Forward mode: ? Perturbation theory Reverse mode: How should I change H given Inverse perturbation theory! and ? ∂ℒ / ∂Ψ ∂ℒ / ∂ E Hamiltonian engineering via differentiable programming https://github.com/wangleiphy/DL4CSRC/tree/master/2-ising See also Fujita et al, PRB ‘18
Differentiable ODE integrators “Neural ODE” Chen et al, 1806.07366 Dynamics systems Principle of least actions S = ∫ ℒ ( q θ , · dx q θ , t ) dt dt = f θ ( x , t ) Classical and quantum control Optics, (quantum) mechanics, field theory…
i dU Quantum optimal control dt = HU https://qucontrol.github.io/krotov/ v1.0.0/11_other_methods.html Differentiable programing (Neural ODE) for unified, flexible, and efficient quantum control Forward mode: slow No gradient: Reverse mode w/ discretize steps: not scalable piesewise-constant assumption
Differentiable ODE integrators “Neural ODE” Chen et al, 1806.07366 Dynamics systems Principle of least actions S = ∫ ℒ ( q θ , · dx q θ , t ) dt dt = f θ ( x , t ) Classical and quantum control Optics, (quantum) mechanics, field theory…
Differentiable functional optimization The brachistochrone problem Johann Bernoulli,1696 T = ∫ x 1 1 + ( dy / dx ) 2 2 g ( y 1 − y 0 ) dx x 0 https://github.com/QuantumBFS/SSSS/tree/master/1_deep_learning/brachistochrone
Differentiable Programming Tensor Networks Liao, Liu, LW, Xiang, 1903.09650, PRX ‘19 https://github.com/wangleiphy/tensorgrad
“Tensor network is 21 century’s matrix” —Mario Szegedy Ψ Quantum circuit architecture, Neural networks and parametrization, and simulation Probabilistic graphical models
Differentiate through tensor renormalization group Computation graph × depth ln Z β � � � � � Contraction � Truncated SVD inverse free Levin, Nave, PRL ‘07 temperature energy 3 . 0 2 . 20 β 2 ∂ 2 ln Z − 1 − ∂ ln Z β ln Z � � ∂β ∂β 2 − 1 . 2 � � 2 . 5 exact exact exact energy density � specific heat free energy � � 2 . 15 � 2 . 0 − 1 . 4 � � 1 . 5 2 . 10 � − 1 . 6 1 . 0 � 2 . 05 0 . 40 0 . 45 0 . 50 0 . 40 0 . 45 0 . 50 0 . 40 0 . 45 0 . 50 β β β Compute physical observables as gradient of tensor network contraction
Differentiable spin glass solver optimal couplings tensor network energy & fields contraction optimal [ ] ∂ energy optimal configuration = [ ] ∂ field Liu, LW, Zhang, 2008.06888 https://github.com/TensorBFS/TropicalTensors.jl
Differentiable iPEPS optimization before… now, w/ differentiable programming Liao, Liu, LW, Xiang, PRX ‘19 grad = + + + + + + + 10 − 2 energy relative error + + + + 10 − 3 + + + + simple update full update 10 − 4 Corboz [34] + + + + Vanderstraeten [35] present work 10 − 5 2 3 4 5 6 7 + + + + D , Best variational energy to date Vanderstraeten et al, PRB ‘16 https://github.com/wangleiphy/tensorgrad 1 GPU (Nvidia P100) week
Differentiable iPEPS optimization Infinite size Finite size Tensor network Neural network 10 − 2 energy relative error 10 − 3 10x10 cluster simple update full update 10 − 4 Corboz [34] Vanderstraeten [35] present work Carleo & Troyer, Science ‘17 10 − 5 2 3 4 5 6 7 D Liao, Liu, LW, Xiang, PRX ‘19 Further progress for challenging physical problems: Chen et al, ‘19 Xie et al, ’20 frustrated magnets, fermions, thermodynamics … Tang et al ’20 …
Recommend
More recommend