de li de ji we ji... Pre-Wiring & Pre-Training : What does a neural network need to learn truly general identity rules? Raquel G. Alhama Willem Zuidema
The Empirical Data: Do infants generalize identity rules? PARTICIPANTS: 7 month old infants [Marcus et al. 1999]
The Empirical Data: Do infants generalize identity rules? PARTICIPANTS: 7 month old infants FAMILIARIZATION: = ji li li ABA: “wi-je-wi le-di-le ji-li-ji … ” = ABB: “wi-je-je le-di-di ji-li-li ...” ji li li [Marcus et al. 1999]
The Empirical Data: Do infants generalize identity rules? PARTICIPANTS: 7 month old infants FAMILIARIZATION: = ABA: “wi-je-wi le-di-le ji-li-ji … ” ji li li = ABB: “wi-je-je le-di-di ji-li-li ...” ji li li TEST: “ba-po-ba ko-ga-ga ba-po-po … ” ABB ABA ABB [Marcus et al. 1999]
The Empirical Data: Do infants generalize identity rules? [Marcus et al. 1999]
The Empirical Data: Do infants generalize identity rules? bapoba bapopo ≠ kogako kogaga RESULT: Difgerential attention between grammars = = = ba po ba wi je wi A B A = = = wi je je A B B ba po po
Modelling the Results = ● Symbolic Cognition X Y Z “XYZ: X is the same as Z” [Marcus et al. 1999]
Modelling the Results = ● Symbolic Cognition X Y Z “XYZ: X is the same as Z” ● Simple Recurrent Network (SRN) – Trained to predict next syllable – Fails to predict novel (test) items [Evaluation: % correct in predicting the third syllable] [Marcus et al. 1999]
Modelling the Results = ● Symbolic Cognition X Y Z “XYZ: X is the same as Z” ● Simple Recurrent Network (SRN) – Trained to predict next syllable – Fails to predict novel (test) items [Evaluation: % correct in predicting the third syllable] A generalizing solution is in the hypothesis space of the SRN – why doesn't it fjnd it? [Marcus et al. 1999]
Simulations with a Simple Recurrent Network Proportion of statistically signifjcant responses to difgerent grammar conditions, out of 400 runs of the model (with difgerent parameter settings)
Simulations with a Simple Recurrent Network Proportion of statistically signifjcant responses to difgerent grammar conditions, out of 400 runs of the model (with difgerent parameter settings)
What is missing in the SRN simulations? ● The SRN was simulated as a tabula rasa – It starts learning from a random state
What is missing in the SRN simulations? ● The SRN was simulated as a tabula rasa – It starts learning from a random state ● Pre-Wiring : What would be a more cognitively plausible initial state? ● Pre-Training : What is the role of prior experience?
Implementation: the Echo State Network ● Same hypothesis space as SRN ● Reservoir Computing approach: only the weights in the output layer are trained (generally with Ridge Regression, but we use Gradient Descent) ● The weights in the reservoir are randomly initialized (with spectral radious < 1 ) ESN, Jaeger (2001) – How can we pre-wire it for this task?
Pre-Wiring: Delay Line Memory ● DELAY LINE MEMORY: mechanism to preserve the input by propagating it in a path with a delay t=0 t=1 t=2 ● Implementation: – “Feed-Forward” structure in the reservoir – Strict or approximated copy
Pre-Wiring: Delay Line Memory
Pre-Wiring: Delay Line Memory
Pre-Wiring: Delay Line Memory
Pre-Wiring: Delay Line Memory
Pre-Wiring: Delay Line Memory Does the model learn the generalized solution?
Simulations with the Delay Line Original Extended
Pre-Training ● There are many solutions that fjt the training data (non-generalizing solutions) ● Where does the pressure to fjnd a general solution come from? – Hypothesis: prior experience with environmental data may have created a domain-general bias for abstract solutions → PRE-TRAINING: Incremental Novelty Exposure
Pre-Training: Incremental Novelty Exposure T R A 1 A 2 A 3 A 4 A I B 1 B 2 B 3 B 4 N I N A i B j A i G C 1 C 2 C 3 C 4 T E S D 1 D 2 D 3 D 4 T C i D j C i
Pre-Training: Incremental Novelty Exposure A 1 B 1 A 5 B 5 T R A 2 A 3 A 4 A 5 A 1 A 2 A 3 A 4 A I B 2 B 3 B 4 B 5 B 1 B 2 B 3 B 4 N I N A i B j A i A i B j A i G C 1 C 2 C 3 C 4 C 1 C 2 C 3 C 4 T E S D 1 D 2 D 3 D 4 D 1 D 2 D 3 D 4 T C i D j C i C i D j C i
Pre-Training: Incremental Novelty Exposure A 1 B 1 A 5 B 5 A 2 B 2 A 6 B 6 T R A 2 A 3 A 4 A 5 A 1 A 2 A 3 A 4 A I ... B 2 B 3 B 4 B 5 B 1 B 2 B 3 B 4 N I N A i B j A i A i B j A i G C 1 C 2 C 3 C 4 C 1 C 2 C 3 C 4 T E ... S D 1 D 2 D 3 D 4 D 1 D 2 D 3 D 4 T C i D j C i C i D j C i
Pre-Training: Incremental Novelty Exposure A 1 B 1 A 5 B 5 A 2 B 2 A 6 B 6 A k-4 B k-4 A k B k T R A k-3 A k-2 A k-1 A k A 2 A 3 A 4 A 5 A 1 A 2 A 3 A 4 A I ... B k-3 B k-2 B k-1 B k B 2 B 3 B 4 B 5 B 1 B 2 B 3 B 4 N I N A i B j A i A i B j A i A i B j A i G C 1 C 2 C 3 C 4 C 1 C 2 C 3 C 4 C 1 C 2 C 3 C 4 T E ... S D 1 D 2 D 3 D 4 D 1 D 2 D 3 D 4 D 1 D 2 D 3 D 4 T C i D j C i C i D j C i C i D j C i
Simulations with Incremental Novelty Exposure Random Random Huge increase in % of correct predictions!
Conclusions ● Finally, simulations of a recurrent network successfully solving the task of Marcus et al. 1999 ● This simple learning problem might hold lessons for more complex architectures solving more complex tasks ● Crucial for success are: (i) Pre-Wiring with a structure that improves memory (cf. LSTM, Memory Networks) (ii) Pre-Training with training regimes that favour generalization (cf. Dropout) Contact: rgalhama@uva.nl https://staff.fnwi.uva.nl/r.garridoalhama/
Efgect of the Delay Line Without Delay Line With Delay Line
Recommend
More recommend