LONG SHOR T-TERM MEMOR Y Neural Comput a tion 9(8):1735{1780, 1997 Sepp Ho c hreiter J urgen � Sc hmidh ub er F akult� at f � ur Informatik IDSIA T ec hnisc he Univ ersit� at M unc � hen Corso Elv ezia 36 80290 M unc � hen, German y 6900 Lugano, Switzerland ho c hreit@informatik.tu-m uenc hen.de juergen@idsia.c h h ttp://www7.informatik.tu-m uenc hen.de/~ho c hreit h ttp://www.idsia.c h/~juergen Abstract Learning to store information o v er extended time in terv als via recurren t bac kpropagation tak es a v ery long time, mostly due to insu�cien t, deca ying error bac k �o w. W e brie�y review Ho c hreiter's 1991 analysis of this problem, then address it b y in tro ducing a no v el, e�cien t, gradien t-based metho d called \Long Short-T erm Memory" (LSTM). T runcating the gradien t where this do es not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps b y enforcing c onstant error �o w through \constan t error carrousels" within sp ecial units. Multiplicativ e gate units learn to op en and close access to the constan t error �o w. LSTM is lo cal in space and time; its computational complexit y p er time step and w eigh t is O (1). Our exp erimen ts with arti�cial data in v olv e lo cal, distributed, real-v alued, and noisy pattern represen tations. In comparisons with R TRL, BPTT, Recurren t Cascade-Correlation, Elman nets, and Neural Sequence Ch unking, LSTM leads to man y more successful runs, and learns m uc h faster. LSTM also solv es complex, arti�cial long time lag tasks that ha v e nev er b een solv ed b y previous recurren t net w ork algorithms. 1 INTR ODUCTION Recurren t net w orks can in principle use their feedbac k connections to store represen tations of recen t input ev en ts in form of activ ations (\short-term memory", as opp osed to \long-term mem- ory" em b o died b y slo wly c hanging w eigh ts). This is p oten tially signi�can t for man y applications, including sp eec h pro cessing, non-Mark o vian con trol, and m usic comp osition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory , ho w ev er, tak e to o m uc h time or do not w ork w ell at all, esp ecially when minimal time lags b et w een inputs and corresp onding teac her signals are long. Although theoretically fascinating, existing metho ds do not pro vide clear pr actic al adv an tages o v er, sa y , bac kprop in feedforw ard nets with limited time windo ws. This pap er will review an analysis of the problem and suggest a remedy . The problem. With con v en tional \Bac k-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, W erb os 1988) or \Real-Time Recurren t Learning" (R TRL, e.g., Robinson and F allside 1987), error signals \�o wing bac kw ards in time" tend to either (1) blo w up or (2) v anish: the temp oral ev olution of the bac kpropagated error exp onen tially dep ends on the size of the w eigh ts (Ho c hreiter 1991). Case (1) ma y lead to oscillating w eigh ts, while in case (2) learning to bridge long time lags tak es a prohibitiv e amoun t of time, or do es not w ork at all (see section 3). The remedy . This pap er presen ts \L ong Short-T erm Memory" (LSTM), a no v el recurren t net w ork arc hitecture in conjunction with an appropriate gradien t-based learning algorithm. LSTM is designed to o v ercome these error bac k-�o w problems. It can learn to bridge time in terv als in excess of 1000 steps ev en in case of noisy , incompressible input sequences, without loss of short time lag capabilities. This is ac hiev ed b y an e�cien t, gradien t-based algorithm for an arc hitecture 1
enforcing c onstant (th us neither explo ding nor v anishing) error �o w through in ternal states of sp ecial units (pro vided the gradien t computation is truncated at certain arc hitecture-sp eci�c p oin ts | this do es not a�ect long-term error �o w though). Outline of pap er. Section 2 will brie�y review previous w ork. Section 3 b egins with an outline of the detailed analysis of v anishing errors due to Ho c hreiter (1991). It will then in tro duce a naiv e approac h to constan t error bac kprop for didactic purp oses, and highligh t its problems concerning information storage and retriev al. These problems will lead to the LSTM arc hitecture as describ ed in Section 4. Section 5 will presen t n umerous exp erimen ts and comparisons with comp eting metho ds. LSTM outp erforms them, and also learns to solv e complex, arti�cial tasks no other recurren t net algorithm has solv ed. Section 6 will discuss LSTM's limitations and adv an tages. The app endix con tains a detailed description of the algorithm (A.1), and explicit error �o w form ulae (A.2). 2 PREVIOUS W ORK This section will fo cus on recurren t nets with time-v arying inputs (as opp osed to nets with sta- tionary inputs and �xp oin t-based gradien t calculations, e.g., Almeida 1987, Pineda 1987). Gradien t-descen t v arian ts. The approac hes of Elman (1988), F ahlman (1991), Williams (1989), Sc hmidh ub er (1992a), P earlm utter (1989), and man y of the related algorithms in P earl- m utter's comprehensiv e o v erview (1995) su�er from the same problems as BPTT and R TRL (see Sections 1 and 3). Time-dela ys. Other metho ds that seem practical for short time lags only are Time-Dela y Neural Net w orks (Lang et al. 1990) and Plate's metho d (Plate 1993), whic h up dates unit activ a- tions based on a w eigh ted sum of old activ ations (see also de V ries and Princip e 1991). Lin et al. (1995) prop ose v arian ts of time-dela y net w orks called NARX net w orks. Time constan ts. T o deal with long time lags, Mozer (1992) uses time constan ts in�uencing c hanges of unit activ ations (deV ries and Princip e's ab o v e-men tioned approac h (1991) ma y in fact b e view ed as a mixture of TDNN and time constan ts). F or long time lags, ho w ev er, the time constan ts need external �ne tuning (Mozer 1992). Sun et al.'s alternativ e approac h (1993) up dates the activ ation of a recurren t unit b y adding the old activ ation and the (scaled) curren t net input. The net input, ho w ev er, tends to p erturb the stored information, whic h mak es long-term storage impractical. Ring's approac h. Ring (1993) also prop osed a metho d for bridging long time lags. Whenev er a unit in his net w ork receiv es con�icting error signals, he adds a higher order unit in�uencing appropriate connections. Although his approac h can sometimes b e extremely fast, to bridge a time lag in v olving 100 steps ma y require the addition of 100 units. Also, Ring's net do es not generalize to unseen lag durations. Bengio et al.'s approac hes. Bengio et al. (1994) in v estigate metho ds suc h as sim ulated annealing, m ulti-grid random searc h, time-w eigh ted pseudo-Newton optimization, and discrete error propagation. Their \latc h" and \2-sequence" problems are v ery similar to problem 3a with minimal time lag 100 (see Exp erimen t 3). Bengio and F rasconi (1994) also prop ose an EM approac h for propagating targets. With n so-called \state net w orks", at a giv en time, their system can b e in one of only n di�eren t states. See also b eginning of Section 5. But to solv e con tin uous problems suc h as the \adding problem" (Section 5.4), their system w ould require an unacceptable n um b er of states (i.e., state net w orks). Kalman �lters. Pusk orius and F eldk amp (1994) use Kalman �lter tec hniques to impro v e recurren t net p erformance. Since they use \a deriv ativ e discoun t factor imp osed to deca y exp o- nen tially the e�ects of past dynamic deriv ativ es," there is no reason to b eliev e that their Kalman Filter T rained Recurren t Net w orks will b e useful for v ery long minimal time lags. Second order nets. W e will see that LSTM uses m ultiplicativ e units (MUs) to protect error �o w from un w an ted p erturbations. It is not the �rst recurren t net metho d using MUs though. F or instance, W atrous and Kuhn (1992) use MUs in second order nets. Some di�erences to LSTM are: (1) W atrous and Kuhn's arc hitecture do es not enforce constan t error �o w and is not designed 2
Recommend
More recommend