Comparing Convolution Kernels and RNNs on a wide-coverage - PowerPoint PPT Presentation

Comparing Convolution Kernels and RNNs on a wide-coverage computational analysis of natural language Fabrizio Costa, Paolo Frasconi, Sauro Menchetti Dept. Systems and Computer Science Università di Firenze Massimiliano Pontil Dept. Information Engineering Università di Siena Related papers available from http://www.dsi.unifi.it/~paolo http://www.dsi.unifi.it/~costa

Overview • Incremental parsing of natural language – A ranking problem on labeled forests • Supervised learning of discrete structures – Recursive neural networks (RNNs) – Kernel-based approaches • New results with RNNs • Experimental comparison

Human vs computer parsing • Computer parsing: typically bottom up – `islands’ are built at the beginning that are subsequently joined together • Human parsing: known to be left-to-right – E.g., perception of speech is sequential, reading is sequential, etc.

Strong incrementality hypothesis • The human parser maintains a connected structure that explains the first n -1 words • When n- th word arrives it is attached to the existing structure NP Left context New word PP D N Connection NP P path (CP) WH D N who The servant of the actress

Attachment ambiguity NP S PP D N NP P S D N WH WH who The servant of the actress E.g. low vs. high attachment

Connection path ambiguity S VP VP VP NP S’ NP N D V athlete realized The t S PRP his NP Even for a fixed attachment point there PRP may be several alternative legal paths his (those matching the POS tag of the new word)

A forest of alternatives • Given a dynamic grammar, a left context and a next word • Many legal trees can be formed attacching a CP • One is correct — we want to predict it

Supervised Learning of Discrete Structures • Lack of methods that handle “directly” recursive or relational structures such as trees and graphs • General approach: 1. Convert structures to real vectors 2. Apply known learning methods on vectors • These steps can be elegantly merged within a more general theoretical framework: 1. Recursive neural networks (Göller & Küchler IJCNN 96, Frasconi etal TNN 98) 2. Kernel machines (Haussler 99, Collins & Duffy NIPS 01, ACL 02)

Differences • Kernel-based methods map a tree into a vector f ( x ) in a very high-dimensional space, perhaps infinite • Bag-of-something kind of representation • Kernel choice difficult (prior knowledge?) • RNN map a tree into a low dimensional vector e.g. f ( x ) Œ¬ 30 • Distributed representation • Task-driven: f ( x ) in this case depend on the specific learning problem

Kernels • Given sets of nonterminals { A , B ,…} and terminals { a , b ,…} there are infinite possible subtrees: A A A B C B A B C a A B a • f i ( t ): count occurrences of subtree i in tree t • f ( t )=[ f 1 ( t ), f 2 ( t ), f 3 ( t ),…] has infinite dimensionality but • f ( t ) T f ( s ) can be computed without actually enumerating all subtrees by dynamic programming (Collins & Duffy NIPS 2001)

Recursive neural networks • Recurrent networks can in principle realize arbitrarily complex dynamical systems • Skepticism: Long-term dependencies cannot be easily learned • But trees are different! – Path lengths are O(log n) – Vanishing gradient problems not as serious for RNNs on trees A B C (A(B(DEFGH)C)) DEFGH

Recursive Neural Networks • Let’s introduce a representation D vector X ( v ) Œ¬ n for each vertex v in tree t B C • X ( v ) computed bottom-up A D

Recursive Neural Networks • Base step: D The representation of external nodes (“nil children”) is a constant B C X ( v ) = X 0 X 0 A D X 0 X 0 X 0 X 0 X 0

Recursive Neural Networks • Induction: D the representation of the subtree rooted at v is a function of 1. The representations at the X ( v ) B C children of v 2. the symbol U ( v ) X ( v ) X ( v ) • X ( v ) = f ( X ( w 1 ),…, X ( w k ), U ( v )) A D • w 1 ,…, w k are v ’s children ( k assigned)

Recursive Neural Networks • What is more precisely f ? D X ( v ) = f ( X ( w 1 ),…, X ( w k ), U ( v )) • f is realized by an MLP: • n outputs, nk+m inputs B C X ( v ) A D ... U ( v ) X ( w 1 ) X ( w k ) m n n

Recursive Neural Networks • The computation continues D bottom-up until the root r is reached • X ( r ) encodes the whole tree in a B C real vector — same role as f ( t ) X ( r ) A D ... U ( r ) X ( w 1 ) X ( w k )

Structure unfolding S VP NP NP NP PP PRP VBZ DT NN IN It has no bearing on

Structure unfolding S VP NP NP NP PP It has no bearing on

Structure unfolding S VP NP It has no bearing on

Structure unfolding S VP It has no bearing on

Structure unfolding S It has no bearing on

Structure unfolding It has no bearing on

Structure unfolding Output network It has no bearing on

Prediction phase Information Flow It has no bearing on

Error Correction: Information Flow It has no bearing on

Disambiguation is a preference task

Learning preferences • Ranking: given an list of entities ( x 1 ,…, x r ) find a corresponding list of integers ( y 1 ,…, y r ), with y i in [1, r ] such that y i is the rank of x i • In total ranking: y i ≠ y j • In our case the favorite element x 1 gets y 1 =1 and other x j get y j =0 – typically r =120 (but goes up to 2000) • Linear utility function: w T x 1 – w T x j > 0 for j =2,…, r • Set of constraints — similar to binary classification but we have differences between vectors • Can be used with SVM and Voting Perceptron: w T [ f ( x 1 ) – f ( x j )]= S sv y [ f ( x 1 ) – f ( x j )] T [ f ( x 1 ) – f ( x j )]

Learning preferences • To get a differentiable version we use the softmax function w T x j y j = e e w T x k Â k • Find w and x j by maximizing Â Â z j log y j + (1 - z j )log(1 - y j ) i j • Where z 1 =1 and z j =0 for j >1 • Gradients wrt x j are passed to the RNN so in this sense x j is an adaptive encoding

Experimental setup • Training on WSJ section of Penn treebank – realistic corpus representative of natural language – large size (40.000 sentence, 1 million words) – uniform language (articles on economic subject) – Train on sections 2-21, test on section 23 • Note: we are not (yet) into building a parser • Extending earlier results (Costa et al 2000, Sturt et al, Cognition , in press)

Results 100 Perc. Correctly predicted 90 80 70 60 50 40 1 2 3 4 RNN RNN500 LC MA

Selecting the right attachment 100 95 90 Perc Correct 85 80 75 70 1 2 3 Position RNN Freq • Given attachment-site, correct connection- path is chosen 89% of the time

Reduced incremental trees Example tree Left context S NP VP NP NP D N PP a friend V D N PP P N the thief of Jim saw P with Connection path

Reduced incremental trees Right frontier S NP VP NP NP D N PP a friend V D N PP P N the thief of Jim saw P with

Reduced incremental trees Right frontier + c-commanding nodes S NP VP NP NP D N PP a friend V D N PP P N the thief of Jim saw P with

Reduced incremental trees Right frontier + c-commanding nodes + connection path S NP VP NP NP D N PP a friend V D N PP P N the thief of Jim saw P with

Reduced incremental trees S NP VP NP NP V D N PP P

Results 100 Perc. Correctly predicted 95 90 85 80 Full tree Reduced tree

Data set partitioning (POS-tag based) 100 50 95 45 90 40 Percentage Correct 85 35 Error reduction 80 30 75 25 70 20 65 15 60 10 55 5 50 0 noun other verb article adjective punctuation conjunction preposition adverb Classes

Comparing RNN and VP • Regularization parameter l =0.5 (best value based on preliminary trials using validation set) • Modularization in 10 POS-tag categories • Performance assessment at 100, 500, 2000, 10000, and 40000 training sentences • Small datasets: CPU(VP) ~ k CPU(RNN) • Larger datasets: – RNN learns in 1-2 epochs (~ 3 days 2GHz) – VP took over 2 months to complete 1 epoch

Comparing Convolution Kernels and RNNs on a wide-coverage - PowerPoint PPT Presentation

Comparing Convolution Kernels and RNNs on a wide-coverage computational analysis of natural language Fabrizio Costa, Paolo Frasconi, Sauro Menchetti Dept. Systems and Computer Science Universit di Firenze Massimiliano Pontil Dept.

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

1 Convolution Convolution is an important operation in signal and image processing. Convolution

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Chapter 8: Fast Convolution Keshab K. Parhi Chapter 8 Fast Convolution Introduction

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

E he i m COMPSCI 527 Computer Vision Correlation, Convolution, Filtering 14 / 26 Image

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Competence and Performance Grammar in Incremental Parsing Vincenzo Lombardo Alessando Mazzei

Neutralizing Linguistically Problematic Annotations i U in Unsupervised Dependency Parsing

Thermodynamic Lattice Study for Preconformal Dynamics in Strongly Flavored QCD-Like Theory

2018 Not-for-Profit Considerations Jay Meglich 1 Economic & Industry Developments

Non Equilibrium Many-Body Perturbation Theory from first principles D. Sangalli CNR-ISM,

beyond the standard model @ the tev scale nathaniel craig uc santa barbara 2017 ICTP Summer

P fluxes and exotic branes Stefano Risoli University of Rome la Sapienza and INFN 18th November

Lattice NRQCD at non-zero temperature Seyong Kim Sejong University in collaboration with G.

Sambuz

Useful Links

Newsletter

Mail Us

Comparing Convolution Kernels and RNNs on a wide-coverage - PowerPoint PPT Presentation

Comparing Convolution Kernels and RNNs on a wide-coverage computational analysis of natural language Fabrizio Costa, Paolo Frasconi, Sauro Menchetti Dept. Systems and Computer Science Universit di Firenze Massimiliano Pontil Dept.

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

1 Convolution Convolution is an important operation in signal and image processing. Convolution

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Chapter 8: Fast Convolution Keshab K. Parhi Chapter 8 Fast Convolution Introduction

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

E he i m COMPSCI 527 Computer Vision Correlation, Convolution, Filtering 14 / 26 Image

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Competence and Performance Grammar in Incremental Parsing Vincenzo Lombardo Alessando Mazzei

Neutralizing Linguistically Problematic Annotations i U in Unsupervised Dependency Parsing

Thermodynamic Lattice Study for Preconformal Dynamics in Strongly Flavored QCD-Like Theory

2018 Not-for-Profit Considerations Jay Meglich 1 Economic &amp; Industry Developments

Non Equilibrium Many-Body Perturbation Theory from first principles D. Sangalli CNR-ISM,

beyond the standard model @ the tev scale nathaniel craig uc santa barbara 2017 ICTP Summer

P fluxes and exotic branes Stefano Risoli University of Rome la Sapienza and INFN 18th November

Lattice NRQCD at non-zero temperature Seyong Kim Sejong University in collaboration with G.

Sambuz

Useful Links

Newsletter

Mail Us

2018 Not-for-Profit Considerations Jay Meglich 1 Economic & Industry Developments