Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Knowledge Extraction from DBNs for Images Son N. Tran and Artur d’Avila Garcez Department of Computer Science City University London city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Contents Introduction 1 Knowledge Extraction from DBNs 2 Experimental Results on Images 3 Conclusion and Future Work 4 city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Motivation Deep networks have shown good performance in image, audio, video and multimodal learning We would like to know why by studying the role of symbolic reasoning in DBNs. In particular, we would like to find out: How knowledge is represented in deep architectures Relations between Deep Networks and a hierarchy of rules How knowledge can be transferred to analogous domains city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Restricted Boltzmann Machine Two-layer symmetric connectionist system [Smolensky, 1986] Represents a joint distribution P ( V , H ) Given training data, learning by Contrastive Divergence (CD) seeks to maximize P ( V ) = ∑ h P ( V , H ) It can be used to approximate the data distribution given new data (rather like an associative memory) city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Restricted Boltzmann Machine (details) Generative model that can be trained to maximize log-likelihood L ( θ |D ) = log ( ∏ x ∈ D P ( v = x )) , where θ is set of parameters (weights and biases) and D is a training set of size n P ( v = x ) = 1 Z ∑ h exp ( − E ( v , h )) , where E is the energy of the network model This log-likelihood is intractable since it is not easy to compute partition function Z = ∑ v , h exp ( − E ( v , h )) But it can be approximated efficiently using CD [Hinton, 2002]; ∆ w ij = 1 n ∑ n ( v i h j ) step 0 − 1 n ∑ n ( v i h j ) step 1 city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Deep Belief Networks Deep Belief Networks [Hinton et al., 2006] Stack of RBMs Greedily learns each pair of layers bottom-up with CD Fine tuning option 1: Split weight matrix into up and down weights (wake-sleep algorithm) Fine tuning option 2: Use as feedforward neural network and update weights using BP city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Deep Belief Networks (example) The lower level layer is expected to capture low-level (class layer - 0 to 9) features Higher level layers combine features to learn progressively more abstract (second hidden layer - shapes) concepts Label can be attached at the top RBM for classification city-logo (first hidden layer - edges)
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Rule Extraction from RBMs: related work [Pinkas, 1995]: rule extraction from symmetric networks using penalty logic ; proved equivalence between conjunctive normal form and energy functions [Penning et al., 2011]: extraction of temporal logic rules from RTRBMs using sampling; extracts rules of the form hypothesis t ↔ belief 1 ∧ , ..., ∧ belief n ∧ hypothesis t − 1 [Son Tran and Garcez, 2012]: rule extraction using confidence-value similar to penalty logic but maintaining implicational form; extraction without sampling city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Rule Extraction from RBMs (cont.) Both penalty [Pinkas, 1995] and confidence-value [Penning et al., 2011, Son Tran and Garcez, 2012] represent the reliability of a rule Inference with penalty logic is to optimize a ranking function, thus similar to weighted-SAT In [Penning et al., 2011], confidence-value is not used for inference, whilst confidence-values extracted by our method can be used for hierarchical inference city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Our method: partial-model extraction Extracts rules c j : h j ↔ � w pj > 0 v p ∧ � w nj < 0 ¬ v n c j = ∑ w ij > 0 w ij − ∑ w ij < 0 w ij (i.e. sum of absolute values of weights); also applies to visible units v i Example: 15 : h 0 ↔ v 1 ∧ ¬ v 2 ∧ ¬ v 3 7 : h 1 ↔ v 1 ∧ v 2 ∧ ¬ v 3 These rules are called partial-model because they capture partially the architecture and behavior of the network city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Our method: complete-model extraction Confidence-vector: h j = [ | w 1 j | , | w 2 j | , ... ] h j Complete rules: c j : h j ↔ � w ij > 0 v i ∧ � w ij < 0 ¬ v i [ 5,3,7 ] ↔ v 1 ∧ ¬ v 2 ∧ ¬ v 3 15 : h 0 [ 2,4,1 ] 7 : h 1 ↔ v 1 ∧ v 2 ∧ ¬ v 3 city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Inference Inference [ w 1 , w 2 ,..., w n ] c : h ↔ b 1 ∧ ¬ b 2 ∧ · · · ∧ b n α 1 : b 1 , α 2 : ¬ b 2 , . . . , α n : b n c h : h where c h = f ( c × ( w 1 α 1 − w 2 α 2 + . . . w n α n )) α i : b i means that b i is believed to hold with confidence α i f is a monotonically nondecreasing function. We use either sign-based ( f ( x ) = 1 if x > 0 otherwise f ( x ) = 0) or logistic function; f normalizes the confidence value to [0,1]. c is the confidence of the rule; c h is the confidence of h In partial-models, w i = c n . The inference is deterministic (but stochastic inference is city-logo possible)
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Partial-model vs. Complete-model Partial model: equalizes weights, can help generalization, good if weights are similar; information loss, otherwise Complete model: very much like the network, but difficult to visualize rules; baseline Example: 2 : h 0 ↔ v 1 ∧ v 2 2 : h 1 ↔ v 1 ∧ v 2 Both rules have the same confidence-value but the first is a city-logo better match to h 0 than the second is to h 1
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References XOR problem X Y Z 0 0 0 0 1 1 1 0 1 1 1 0 − 10.0600 − 9.8485 25 : h 0 ↔ ¬ x ∧ y ∧ z 3.9304 W = 9.6408 9.5271 − 7.5398 23 : h 1 ↔ x ∧ y ∧ ¬ z − 9.9315 − 9.8054 27 : h 2 ↔ ¬ x ∧ ¬ y ∧ ¬ z 5.0645 4.5371 ] ⊤ visB = [ 4.5196 − 4.3642 13 : ⊤ ↔ x ∧ ¬ y ∧ z If z is ground-truth then the combined, normalized rule is: 0.999 : z ← ( x ∧ ¬ y ) ∨ ( ¬ x ∧ y ) city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Logical inference vs. Stochastic inference DBN with 748-500-500-2000 nodes (+10 label nodes) was trained on MNIST handwritten digits dataset Figure shows the result of downward inference from the labels using the network (top) and using its complete model with a sigmoid function f for logical inference (bottom) To reconstruct the images from the labels using the network, we run up-down inference several times; to reconstruct the images from the rules, Gibbs sampling is not used, and we go downwards once through the rules city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References System pruning One can use rule extraction to prune the network by removing hidden units corresponding to rules with low confidence-value Reconstruction of images from pruned RBM (a) 500 units (b) 382 units (c) 212 units (d) 145 units Classification by SVM using features from pruned RBMs city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Transfer Learning Problems in Machine Learning: Data in problem domain is limited Data in problem domain is difficult to label Prior knowledge in problem domain is hard to obtain Solution : Learn the knowledge from unlabelled data from related domains which are largely available and transfer the knowledge to the problem domain. city-logo
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Transferring Knowledge to Learn Source domain: MNIST handwritten digits Target domains: ICDAR (digit recognition), TiCC (writer recognition) (a) MNIST dataset (b) ICDAR dataset city-logo (c) TiCC dataset
Introduction Knowledge Extraction from DBNs Experimental Results on Images Conclusion and Future Work References Experimental Results Source:Target SVM RBM PM Transfer CM Transfer 68.50 65.50 66.50 66.50 MNIST : ICDAR 38.14 50.00 50.51 51.55 72.94 78.82 79.41 81.18 MNIST : TiCC 73.44 80.23 83.05 80.79 Figure : TiCC average accuracy vs. size of transferred knowledge city-logo
Recommend
More recommend