Finding Structure in Time Jeffrey L. Elman In Cognitive Science 14, 179 – 211 (1990) presented by Dominic Seyler (dseyler2@illinois.edu)
Outline • Motivation • Method • Experiments • Exclusive-Or • Structure in Letter Sequences • Discovering the Notion “Word” • Discovering Lexical Classes • Conclusions
Motivation: The Problem with Time • Previous methods of representing time • Associate serial order of temporal pattern with dimensionality of pattern vector • [ 0 1 0 0 1 ] <- first, second, third... event in temporal order • There are several downsides of presenting time this way • Input buffer is required to represent events all at once • All input vectors must be the same length and provide for the longest possible temporal pattern • Most importantly: Cannot distinguish relative from absolute temporal position [ 0 1 1 1 0 0 0 0 0 ] [ 0 0 0 1 1 1 0 0 0 ]
An Alternative Way of Treating Time • Don’t model time as an explicit part of the input • Allow time to be represented by the effect it has on processing • Networks allows hidden units to see previous output • Recurrent connections are what give the network memory
Approach: Recurrent Neural Network • Argument input with additional units (context units) • When input is processed sequentially, the context units contain the exact values of the hidden units of the previous sequence • The hidden units map the external input and previous internal state to desired output
Exclusive-OR • XOR function cannot be learned by a simple two-layer network • Temporal XOR: One input bit is presented at a time, predict next bit • Input: 1 0 1 0 0 0 • output: 0 1 0 0 0 ? • Training: Run 600 passes through a 3,000 bit XOR sequence
Exclusive-OR (cont.) • It is only sometimes possible to predict the next bit correctly • After one bit, there is a 50/50 chance • After two bits, the third bit will be the XOR of the first and second
Structure in Letter Sequences • Idea: Extend prediction from one bit vectors to more complex predictions (multi-bit) • Method: • map six letters to a binary representation (b, d, g, a, i, u) • Use three consonants to create a random 1,000 letter sequence • Replace each consonant by adding vowels: b -> ba; d -> dii; g -> guuu • Example input: dbgbddg -> diibaguuubadiidiiguuu • Prediction task: given the bit representations of characters in sequence, predict the character word
Structure in Letter Sequences (cont.) • Since consonants where ordered randomly there is high error • Vowels are not random, therefore the network can make use of previous information. Thus, error is low. • Takeaway: Since the input is structured the network can make partial predictions even where the complete prediction is not possible
Discovering the Notion “Word” • Learning a language involves learning words • Can the network automatically learn “words”, when given a sequential list of concatenated characters? • Words are represented as concatenated bit vectors of their characters • These bit vectors are concatenated to form sentences • Then, each character is inputted sequentially and the network has to predict the following letter • input: manyyearsago • output: anyyearsago?
Discovering the Notion “Word” (cont.) • At the onset of each word error is high • As more of the word is received, error declines • Error provides good clue as to what the recurring sequences in the input are and highly correlates with words • Network can learn boundaries of linguistic units from input signal
Discovering Lexical Classes from Word Order • Can network learn the abstract structure that underlies sentences, when only the surface forms (i.e. words) are presented to it? • Method • Define a set of category-to-word mappings (e.g., NOUN-HUMAN -> man, woman; VERB-PERCEPTION -> smell, see) • Use templates to create sentences (e.g., NOUN-HUMAN, VERB-EAT, NOUN- FOOD) • Words in sentence (e.g., ”woman eat bread”) are mapped to one-hot- vectors (e.g. 00010 00100 10000 ) • Task: Given a word vector (“woman”) predict next word (“eat”).
Discovering Lexical Classes (cont.) • Since prediction task is nondeterministic RMS error is not a fitting measurement • Save the hidden unit vectors for each word in all possible contexts and average over them • Perform hierarchical clustering • Similarity structure of internal representations is shown in tree
Discovering Lexical Classes (cont.) • Network has developed internal representations for the input vectors which reflect facts about possible sequential ordering of inputs • Hidden unit patterns are not word representations in the conventional sense, since patterns also reflect prior context. • Error in predicting the actual next word in a given context is high, but the network is able to predict the approximate likelihood of occurrence of classes and words • A given node in hidden layer participates in multiple concepts. Only the activation pattern in its entirety is meaningful.
Conclusions • Networks can learn temporal structure implicitly • Problems change their nature when expressed as temporal events (XOR could previously not be learned by single-layer network) • Error signal is a good metric of where structure exists (Error was high at the beginning of words in sentence) • Increasing complexity does not necessarily result in worse performance (Increasing number of bits did not hurt performance) • Internal representations can be hierarchical in nature (Similarity was high among words within one class)
Finding Structure in Time Jeffrey L. Elman In Cognitive Science 14, 179 – 211 (1990) presented by Dominic Seyler (dseyler2@illinois.edu)
Recommend
More recommend