lstm a search space odyssey
play

LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan - PowerPoint PPT Presentation

LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan Koutn k, Bas R. Steunebrink, Ju rgen Schmidhuber, 2015. Presenter: Yijun Tian, Zhenyu Liu Abstract In this paper, the authors analyze performance of LSTM and its


  1. LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan Koutn ́ık, Bas R. Steunebrink, Ju ̈rgen Schmidhuber, 2015. Presenter: Yijun Tian, Zhenyu Liu

  2. Abstract ● In this paper, the authors analyze performance of LSTM and its eight variants on three representative tasks: speech recognition, handwriting recognition and polyphonic music modeling. ● Hyperparameters for each variant were optimized individually using random search and importance was gauged using fANOVA (a tool for assessing hyperparameters importance). NYU Courant

  3. Dataset ● TIMIT: the TIMIT Speech corpus Speech Recognition ● IAM Online: the IAM Online Handwriting Database Handwriting Recognition ● JSB Chorales: a collection of 382 fourpart harmonized chorales by J. S. Bach Polyphonic Music Modeling NYU Courant

  4. Vanilla LSTM N: number of LSTM blocks. M: input size NYU Courant

  5. LSTM Variants NYU Courant

  6. Experiments ● Performed 27 random searches (one for each combination of the nine variants and three datasets). ● Each random search encompasses 200 trials of randomly sampling the following hyperparameters: ● Number of LSTM blocks per hidden layer. ● Learning rate, momentum, standard deviation of Gaussian input noise NYU Courant

  7. NOAF and NFG performs significantly worse Results NYU Courant

  8. Learning rate and network size are important hyperparameters Results NYU Courant

  9. Conclusions and Insights ● None of the variants improve upon the standard LSTM architecture significantly. ● Coupling the input and forget gates (CIFG) or removing peephole connections (NP) are attractive models. ● Forget gate and output activation function are the most critical components of the LSTM block. ● Learning rate and network size are important. ● No apparent structure of hyperparameter interaction. NYU Courant

  10. Take home message: The most commonly used LSTM architecture (vanilla LSTM) performs reasonably well on various datasets. Thank you! Questions?

Recommend


More recommend