Model-Free Stochastic Perturbative Adaptation and Optimization Gert - PowerPoint PPT Presentation

Model-Free Stochastic Perturbative Adaptation and Optimization Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon

Model-Free Stochastic Perturbative Adaptation and Optimization OUTLINE • Model-Free Learning – Model Complexity – Compensation of Analog VLSI Mismatch • Stochastic Parallel Gradient Descent – Algorithmic Properties – Mixed-Signal Architecture – VLSI Implementation • Extensions – Learning of Continuous-Time Dynamics – Reinforcement Learning • Model-Free Adaptive Optics – AdOpt VLSI Controller – Adaptive Optics “Quality” Metrics – Applications to Laser Communication and Imaging G. Cauwenberghs 520.776 Learning on Silicon

The Analog Computing Paradigm • Local functions are efficiently implemented with minimal circuitry, exploiting the physics of the devices. • Excessive global interconnects are avoided: – Currents or charges are accumulated along a single wire. – Voltage is distributed along a single wire. Pros: – Massive Parallellism – Low Power Dissipation – Real-Time, Real-World Interface – Continuous-Time Dynamics Cons: – Limited Dynamic Range – Mismatches and Nonlinearities (WYDINWYG) G. Cauwenberghs 520.776 Learning on Silicon

Effect of Implementation Mismatches SYSTEM INPUTS OUTPUTS { } p i ε ( p ) REFERENCE Associative Element: – Mismatches can be properly compensated by adjusting the parameters p i accordingly, provided sufficient degrees of freedom are available to do so. Adaptive Element: – Requires precise implementation – The accuracy of implemented polarity (rather than amplitude) of parameter update increments ∆ p i is the performance limiting factor. G. Cauwenberghs 520.776 Learning on Silicon

Example: LMS Rule A linear perceptron under supervised learning: ( k ) = Σ ( k ) y i p ij x j j target ( k ) - y i ( k ) 2 ε = 1 Σ Σ ) ( y i 2 k j with gradient descent: ( k ) = - η ∂ ε ( k ) ( k ) ⋅ y i target ( k ) - y i ( k ) ∆ p ij = - η x j ∂ p ij reduces to an incremental outer-product update rule, with scalable, modular implementation in analog VLSI. G. Cauwenberghs 520.776 Learning on Silicon

Incremental Outer-Product Learning in Neural Nets x i p ij x j j i e i e j Σ x i = f ( ) p ij x j Multi-Layer Perceptron: j ∆ p ij = η x j ⋅ e i Outer-Product Learning Update: e i = x i – Hebbian (Hebb, 1949) : target - x i e i = f ' i ⋅ x i – LMS Rule (Widrow-Hoff, 1960) : Σ e j = f ' j ⋅ – Backpropagation (Werbos, Rumelhart, LeCun) : p ij e i i G. Cauwenberghs 520.776 Learning on Silicon

Gradient Descent Learning Minimize ε ( p ) by iterating: ( k ) - η ∂ ε ( k ) ( k + 1) = p i p i ∂ p i from calculation of the gradient: ∂ ε ∂ ε ⋅ ∂ y l ⋅ ∂ x m Σ Σ = ∂ p i ∂ y l ∂ x m ∂ p i m l Implementation Problems: – Requires an explicit model of the internal network dynamics. – Sensitive to model mismatches and noise in the implemented network and learning system. – Amount of computation typically scales strongly with the number of parameters. G. Cauwenberghs 520.776 Learning on Silicon

Gradient-Free Approach to Error-Descent Learning Avoid the model sensitivity of gradient descent, by observing the parameter dependence of the performance error on the network directly, rather than calculating gradient information from a pre- assumed model of the network. Stochastic Approximation: – Multi-dimensional Kiefer-Wolfowitz (Kushner & Clark 1978) – Function Smoothing Global Optimization (Styblinski & Tang 1990) – Simultaneous Perturbation Stochastic Approximation (Spall 1992) Hardware-Related Variants: – Model-Free Distributed Learning (Dembo & Kailath 1990) – Noise Injection and Correlation (Anderson & Kerns; Kirk & al. 1992-93) – Stochastic Error Descent (Cauwenberghs 1993) – Constant Perturbation, Random Sign (Alspector & al. 1993) – Summed Weight Neuron Perturbation (Flower & Jabri 1993) G. Cauwenberghs 520.776 Learning on Silicon

Stochastic Error-Descent Learning Minimize ε ( p ) by iterating: p ( k +1) = p ( k ) – µ ε ( k ) π ( k ) from observation of the gradient in the direction of π ( k ) : ε ( k ) = 1 2 ε ( p ( k ) + π ( k ) ) – ε ( p ( k ) – π ( k ) ) with random uncorrelated binary components of the perturbation vector π ( k ) : ( k ) = ±σ ; E( π i ( k ) π j ( l ) ) ≈ σ 2 δ ij δ kl π i Advantages: – No explicit model knowledge is required. – Robust in the presence of noise and model mismatches. – Computational load is significantly reduced. – Allows simple, modular, and scalable implementation. – Convergence properties similar to exact gradient descent. G. Cauwenberghs 520.776 Learning on Silicon

Stochastic Perturbative Learning Cell Architecture φ ( t ) – η ε ( t ) ^ φ ( t ) π i ( t ) – η ε ( t ) ^ NETWORK p i ( t ) + φ ( t ) π i ( t ) p i ( t ) Σ Σ ε ( p ( t ) + φ ( t ) π ( t )) z –1 LOCAL GLOBAL ε ( k ) = 1 p ( k +1) = p ( k ) – µ ε ( k ) π ( k ) 2 ε ( p ( k ) + π ( k ) ) – ε ( p ( k ) – π ( k ) ) G. Cauwenberghs 520.776 Learning on Silicon

Stochastic Perturbative Learning Circuit Cell V σ + V σ – π i π i π i EN p V bp C perturb sign( ε ) ^ POL p i ( t ) + φ ( t ) π i ( t ) C store V bn EN n π i G. Cauwenberghs 520.776 Learning on Silicon

Charge Pump Characteristics EN p (b) V bp I adapt V stored POL ∆ Q adapt C V bn (a) EN n Voltage Decrement ²V stored (V) 0 0 ∆ t = 40 msec 10 Voltage Increment ²V stored (V) 10 ∆ t = 40 msec 1 msec -1 -1 10 10 1 msec -2 -2 10 10 23 µsec -3 -3 10 10 23 µsec -4 -4 10 10 ∆ t = 0 -5 -5 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 Gate Voltage V bn (V) Gate Voltage V bp (V) (a) (b) G. Cauwenberghs 520.776 Learning on Silicon

Supervised Learning of Recurrent Neural Dynamics BINARY QUANTIZATION Q (.) Q (.) Q (.) Q (.) Q (.) Q (.) TEACHER FORCING π 1 H x 1 ( t ) W 11 W 12 W 13 W 14 W 15 W 16 x 1 T ( t ) x 1 π 2 H x 2 ( t ) UPDATE ACTIVATION AND PROBE MULTIPLEXING W 21 W 22 x 2 T ( t ) x 2 π 3 H W 31 x 3 π 4 H W 41 x 4 π 5 H W 51 W 56 x 5 π 6 H W 65 W 66 W 61 x 6 + π 0 H I ref θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 DYNAMICAL SYSTEM – I ref W off W off W off W off W off W off x ( t ) z ( t ) d x y ( t ) d t = ( ) ( ) π 1 π 2 π 3 π 4 π 5 π 6 F p , x , y z = V V V V V V G x G. Cauwenberghs 520.776 Learning on Silicon

The Credit Assignment Problem or How to Learn from Delayed Rewards SYSTEM INPUTS OUTPUTS { } p i r*(t) r(t) ADAPTIVE CRITIC – External, discontinuous reinforcement signal r(t). – Adaptive Critics: • Discrimination Learning (Grossberg, 1975) • Heuristic Dynamic Programming (Werbos, 1977) • Reinforcement Learning (Sutton and Barto, 1983) • TD( λ ) (Sutton, 1988) • Q-Learning (Watkins, 1989) G. Cauwenberghs 520.776 Learning on Silicon

Reinforcement Learning (Barto and Sutton, 1983) Locally tuned, address encoded neurons: χ ( t ) ∈ {0, ... 2 n –1} : n –bit address encoding of state space y ( t ) = y χ ( t ) : classifier output q ( t ) = q χ ( t ) : adaptive critic Adaptation of classifier and adaptive critic: y k ( t +1) = y k ( t ) + α r ( t ) e k ( t ) y k ( t ) q k ( t +1) = q k ( t ) + β r ( t ) e k ( t ) – eligibilities: e k ( t +1) = λ e k ( t ) + (1 – λ ) δ k χ ( t ) – internal reinforcement: r ( t ) = r ( t ) + γ q ( t ) – q ( t – 1) G. Cauwenberghs 520.776 Learning on Silicon

Reinforcement Learning Classifier for Binary Control State Eligibility SEL hor Vbp UPD q vert SEL vert V α p q k e k UPD V δ ^ Vbn Vbn r Neuron Select Adaptive Critic SEL hor (State) HYST Vbp 64 Reinforcement Learning Neurons UPD UPD HYST x 2 ( t ) y vert V α p Vbp LOCK LOCK V α n State (Quantized) y k x 1 ( t ) Vbn Action Network SEL hor y = –1 y ( t ) y = 1 Action (Binary) u ( t ) G. Cauwenberghs 520.776 Learning on Silicon

A Biological Adaptive Optics System brain iris retina lens zonule fibers cornea optic nerve G. Cauwenberghs 520.776 Learning on Silicon

Wavefront Distortion and Adaptive Optics • Imaging • Laser beam - defocus - beam wander/spread - motion - intensity fluctuations G. Cauwenberghs 520.776 Learning on Silicon

Adaptive Optics Conventional Approach – Performs phase conjugation • assumes intensity is unaffected – Complex • requires accurate wavefront phase sensor (Shack-Hartman; Zernike nonlinear filter; etc.) • computationally intensive control system G. Cauwenberghs 520.776 Learning on Silicon

Adaptive Optics Model-Free Integrated Approach Incoming wavefront Wavefront corrector with N elements: u 1 ,…,u n ,…,u N – Optimizes a direct measure J of optical performance (“quality metric”) – No (explicit) model information is required • any type of quality metric J, wavefront corrector (MEMS, LC, …) • no need for wavefront phase sensor – Tolerates imprecision in the implementation of the updates • system level precision limited by accuracy of the measured J G. Cauwenberghs 520.776 Learning on Silicon

Adaptive Optics Controller Chip Optimization by Parallel Perturbative Stochastic Gradient Descent image Φ (u) J( u ) wavefront performance corrector metric sensor J( u ) u AdOpt VLSI wavefront controller G. Cauwenberghs 520.776 Learning on Silicon

Model-Free Stochastic Perturbative Adaptation and Optimization Gert - PowerPoint PPT Presentation

Model-Free Stochastic Perturbative Adaptation and Optimization Gert Cauwenberghs Johns Hopkins University gert@jhu.edu 520.776 Learning on Silicon http://bach.ece.jhu.edu/gert/courses/776 G. Cauwenberghs 520.776 Learning on Silicon

Non-perturbative Effects in Type II/F-Theory Mirjam Cveti Outline: Non-perturbative physics

Infrared QCD: perturbative or non perturbative? aez 1 , U. Reinosa 2 , J. Serreau 3 , M. Pel M.

QCD Kondo effect - Perturbative to Non-perturbative - Sho Ozaki (Keio Univ.) Strangeness and

Non equilibrium free energies in systems with long range interactions and models of geophysical

LECTURE 4 LECTURE 4 Introduction to the Parton Model and Perturbative QCD Fred Olness (SMU)

Non-perturbative beyond the Standard Model physics Enrico Rinaldi This research was performed

Stochastic six vertex model Ivan Corwin (Columbia University) Stochastic six vertex 1 Page 1

Surrogate models for Single and Multi-Objective Stochastic Optimization: Integrating Support

Perturbative and non Perturbative calculations of holographic Renyi relative divergence Tomonori

Resurgence and Non-Perturbative Physics Gerald Dunne University of Connecticut Non-Perturbative

vBag - a bag model extension with non-perturbative corrections T.Klahn, T.Fischer, M. Hempel

Towards Assumption-free Unsupervised Domain Adaptation for Visual recognition

Study of one-particle spectra at high-pT at LHC energies Perturbative and non-perturbative

Neutron stars in a perturbative f(R) gravity model with strong magnetic fields and contributions

Language Model Adaptation Hsin-min Wang References: X. Huang et. al., Spoken Language

Limiting Spectral Distribution of Stochastic Block Model Yizhe Zhu University of Washington

Translation Model Adaptation Using Genre-Revealing Text Features Marlies van der Wees, Arianna

Appearance of determinants for stochastic growth models T. Sasamoto 19 Jun 2015 @GGI 1 0. Free

Stochastic Burst Synchronization in A Scale-Free Neural Network with Spike-Timing-Dependent

ADAPTATION: An interdisciplinary and systemic approach to investigate drivers response to

A stochastic model for biological neuronal nets Antonio Galves Eva L ocherbach First Workshop

Lower bounds for the Stock Price density in a Local-Stochastic Volatility Model V. Bally and S.

Adaptation to Climate Change Adaptation to Climate Change Impact in China Impact in China - A

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic