Compiling Deep Nets Scott Sanner
Goal of this talk • Will not evangelize deep networks / successes – Go to ICML, NIPS, Silicon Valley, read tech news – “Just believe” • But deep nets do not solve all problems – Yet – Lack techniques for handling arbitrary queries – With compilations, that could change
Probabilistic Inference with Arbitrary Queries Why Deep Nets and not Graphical Models?
Graphical Models Revisited • HMM / Chain-CRF LSTM-based RNN • (Cond.) Ising Model Convolutional NN
Graphical Models vs. Deep Nets Not just learning, also Graphical Models Deep (Generative) Neural Networks planning/control (Wu, Say, Sanner; • Structured • Also structured NIPS-17) • Convex (parameter learning, • Convex? What’s that? if exp family and not latent) – Adam, RMSProp work well, see (Neural Taylor Expansion, ICML-17) • Latent models are more niche • It’s all about latent (hidden) – Mixture models, LDA, Bayesians – Massively overparameterized hidden layer representation Maybe we – Helps with non-convexity could cross- pollinate back – Exacerbates overfitting – need novel to GMs? regularizers ( dropout, batch norm ) • Arbitrary Exact Inference: P(Q|E) • Arbitrary Exact Inference: P(Q|E) – Intractable (unless compiled) – Unknown (can we compile?)
Should we all switch to Deep Nets? • Not quite yet… • Deep nets are much more specialized than the general motivation for graphical models • In order to answer general P(Q|E) – First need a deep generative model Many flavors – But most currently do inference via sampling – How to do arbitrary exact inference? • Compilations required to support such inference
Remainder of Talk • Deep Generative Models • Arithmetic and Continuous Decision Diagrams – Where my focus has been Treewidth is a discrete graphical – Support marginalization for queries model notion • Though really hard to bound inference complexity • Not only option, but need continuous compilations • Compiling Deep Generative Models to DDs
Deep Generative Models Alphabet Soup: GANs, VAEs, etc.
Vanilla ReLU Deep Network Structure Input Layer Hidden Layer Output Layer Input/output Note: ReLU is Hidden just a piecewise units linear function (Rectified Linear Units) Slide from Buser Say
Generative Adversarial Networks (GANs) • Generator + Discriminator framework – “Fake Data” is from generative model • Can captures complex distributions through “refined” backpropagation – For fictitious image generation, can generate clearer images than autoencoders minimizing RMSE Slide from Ga Wu
Variational Auto-Encoders (VAEs) • Optimize variational lower bound of P(X) • Two way mapping – Encoder: P(Z|X) – Decoder: P(X|Z) – generative model • Re-parameterization Trick – N( µ , σ ) = µ + σ N(0,1) – Separate deterministic reasoning from stochastic part Slide from Ga Wu
Deep Autoregressive Networks • Standard Graphical Model – Except that conditional probabilities are deep networks E B • Some recent more complex variants – WaveNet, PixelCNN, PixelRNN A X • Note: cannot use standard message passing algorithms with deep net factors – But we might use decision diagrams Slide from Ga Wu
Decision Diagrams Alphabet Soup: ADDs, AADDs, XADDs
Function Representation (ADDs) • Why not a directed acyclic graph (DAG)? a b c F(a,b,c) a 0 0 0 0.00 0 0 1 0.00 b Algebraic 0 1 0 0.00 Decision 0 1 1 1.00 Diagram c 1 0 0 0.00 (ADD) 1 0 1 1.00 1 1 0 0.00 1 0 1 1 1 1.00 Exploits context-specific independence (CSI) and shared substructure.
Trees vs. ADDs • AND OR XOR x 1 x 1 x 1 x 2 x 2 x 2 x 2 x 3 x 3 x 3 x 3 1 0 0 1 1 0 • Trees can compactly represent AND / OR – But not XOR (linear as ADD, exponential as tree) – Why? Trees must represent every path
Binary Operations (ADDs) • Why do we order variable tests? • Enables us to do efficient binary operations… a a Result: ADD a operations can avoid state enumeration b b c c c 0 2 1 0 0 2
ADD Inefficiency • Are ADDs enough? • Or do we need more compactness? • Ex. 1: Additive reward/utility functions a b b – R(a,b,c) = R(a) + R(b) + R(c) c c c c = 4a + 2b + c 7 6 5 4 3 2 1 0 • Ex. 2: Multiplicative value functions a b b – V(a,b,c) = V(a) ⋅ V(b) ⋅ V(c) c c c c = γ (4a + 2b + c) γ 6 γ 5 γ 4 γ 3 γ 2 γ 1 γ 7 γ 0
(Sanner, McAllester, IJCAI-05) Affine ADD (AADD) • Define a new decision diagram – Affine ADD • Edges labeled by offset ( c ) and multiplier ( b ): a <c 1 ,b 1 > <c 2 ,b 2 > F 1 F 2 • Semantics: if (a) then (c 1 +b 1 F 1 ) else (c 2 +b 2 F 2 )
Affine ADD (AADD) • Maximize sharing by normalizing nodes [0,1] Need top-level affine • Example: if (a) then (4) else (2) transform to recover original range <2,2> a a Normalize <4,0> <2,0> <1,0> <0,0> 0 0
Automatically AADD Examples Constructed! • Back to our previous examples… • Ex. 1: Additive reward/utility functions <0,3> a • R(a,b) = R(a) + R(b) <0,1/3> <2/3,1/3> = 2a + b b <1,0> <0,0> 0 • Ex. 2: Multiplicative value functions < γ 3 , 1- γ 3 > a < γ - γ 3 , 1- γ > <0, γ 2 - γ 3 > • V(a,b) = V(a) ⋅ V(b) 1- γ 3 1- γ 3 1- γ 3 = γ (2a + b) ; γ <1 b <1,0> <0,0> 0
ADDs vs. AADDs • Additive functions: ∑ i=1..n x i Note: no context-specific independence, but subdiagrams shared: result size O(n 2 )
ADDs vs. AADDs • Additive functions: ∑ i 2 i x i – Best case result for ADD (exp.) vs. AADD (linear)
x 1 x 2 ADDs vs. AADDs • Additive functions: ∑ i=0..n-1 F(x i ,x (i+1) % n ) x 7 x 3 x 6 x 4 x 5 Pairwise factoring evident in AADD structure
But we want to compile deep networks Hidden layers are continuous
ReLU Deep Nets are Piecewise Linear! E.g., see MILP compilation of ReLU deep nets for optimization (Say, Wu, Zhou, Sanner; IJCAI-17) Input/output Hidden Note: ReLU is units just a piecewise linear function (Rectified Linear Units) Slide from Buser Say
Case → XADD 8 x 1 + k > 100 ^ x 2 + k > 100 : 0 > > > > > x 1 + k > 100 ^ x 2 + k 100 : x 2 > > > > > x 1 + k 100 ^ x 2 + k > 100 : x 1 > < x 1 + x 2 + k > 100 ^ x 1 + k 100 ^ x 2 + k 100 ^ x 2 > x 1 : x 2 V = > > > x 1 + x 2 + k > 100 ^ x 1 + k 100 ^ x 2 + k 100 ^ x 2 x 1 : x 1 > > > > x 1 + x 2 + k 100 : x 1 + x 2 > > > > . . : . . . . Sanner et al (UAI-11) Sanner and Abbasnejad (AAAI-12) Zamani, Sanner et al (AAAI-12)
Compactness of (X)ADDs φ 1 • XDD is linear in φ 2 φ 2 # of decisions φ i φ 3 φ 3 • Case version has φ 4 φ 4 exponential number of partitions! φ 5 φ 5 1 0
XADD Maximization y > 0 max( , ) = x > 0 x > 0 y > 0 x > 0 x > y y x y x x y May introduce new decision tests Operations exploit structure: O(|f||g|)
Maintaining XADD Orderings • Max may get decisions out of order y > 0 Decision ordering (root → leaf) max( , ) = x > 0 x > 0 y > 0 x > 0 • x > y x > y y • y > 0 x y x • x > 0 x y Newly introduced node is out of order!
Maintaining XADD Orderings • Substitution may get decisions out of order Decision σ ={ z/y } y > 0 y > 0 ordering (root → leaf): = • x > y x > z x > z x > y x > y • y > 0 • x > z x z x y x y x y Substituted nodes are now out of order!
Correcting XADD Ordering • Obtain ordered XADD from unordered XADD – key idea: binary operations maintain orderings z is out of order result will have z in order! z ⊗ ⊗ z z ⊕ ID 1 ID 0 1 0 0 1 ID 1 ID 0 All operands ordered, so Inductively assume ID 1 applying ⊗ , ⊕ produces and ID 0 are ordered. ordered result!
Maintaining Minimality y > 0 y > 0 x > 0 x > 0 x + y < 0 y x y x + y x + y Node unreachable – If linear , can detect with More subtle x + y < 0 always feasibility checker of LP prunings as false if x > 0 & y > 0 solver & prune well.
What’s the Minimal Diagram? x > 7 x > 8 x > 6 x > 8 x > 6 1 2 3 1 2 3 3 2 1 Search through Canonicity still an all possible node 6 7 8 open question! rotations to find?
Affine XADD? We’re working on it (can define affine different ways)
Compiling Deep Nets
Key idea: Compile with XADD Apply! Deep Learned State at State at weights time t time t+1 Input/output Build bottom-up… Hidden each node is an units “Apply” sum and max operation of children! (Rectified Linear Units) Many more details depending on the source model, but this is key idea permitting compilation and automated inference w.r.t. deep generative model source.
Recommend
More recommend