Policy Gradient as a Proxy for Dynamic Oracles in Constituency - PowerPoint PPT Presentation

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Daniel Fried and Dan Klein

Parsing by Local Decisions S VP NP NP nap . a The cat took (S (NP The cat ) (VP … 𝑀 𝜄 = log 𝑞 𝑧 𝑦; 𝜄) = ෍ log 𝑞(𝑧 𝑢 |𝑧 1:𝑢−1 , 𝑦; 𝜄) 𝑢

Non-local Consequences Loss-Evaluation Mismatch S S 𝑧 NP 𝑧 ො VP VP NP VP NP NP took a nap . took a nap . The cat The cat ∆(𝑧, ො 𝑧) : -F1 (𝑧, ො 𝑧) Exposure Bias True 𝑧 (S (NP The cat … Parse 𝑧 ො Prediction (S (NP (VP ?? [Ranzato et al. 2016; Wiseman and Rush 2016]

Dynamic Oracle Training Explore at training time. Supervise each state with an expert policy. 𝑧 (S (NP The cat … True Parse addresses Prediction 𝑧 ො The (S (NP (VP … exposure (sample, or greedy) bias 𝑧 ∗ Oracle The The (NP cat addresses ∗ 𝑧 𝑢 choose to maximize ∗ |ො 𝑀 𝜄 = ෍ log 𝑞(𝑧 𝑢 𝑧 1:𝑢−1 , 𝑦; 𝜄) loss achievable F1 (typically) mismatch 𝑢 [Goldberg & Nivre 2012; Ballesteros et al. 2016; inter alia]

Dynamic Oracles Help! Expert Policies / Dynamic Oracles mostly Daume III et al., 2009; Ross et al., 2011; dependency Choi and Palmer, 2011; Goldberg and Nivre, 2012; parsing Chang et al., 2015; Ballesteros et al., 2016; Stern et al. 2017 PTB Constituency Parsing F1 Static Dynamic System Oracle Oracle Coavoux and Crabbé, 2016 88.6 89.0 Cross and Huang, 2016 91.0 91.3 Fernández-González and 91.5 91.7 Gómez-Rodríguez, 2018

What if we don’t have a dynamic oracle? Use reinforcement learning

Reinforcement Learning Helps! (in other tasks) machine translation Auli and Gao, 2014; Ranzato et al., 2016; Shen et al., 2016 Xu et al., 2016; Wiseman and Rush, 2016; Edunov et al. 2017 machine several, CCG translation including parsing dependency parsing

Policy Gradient Training Minimize expected sequence-level cost: 𝑧 𝑧 ො True Parse Prediction S S 𝑆(𝜄) = ෍ 𝑞 ො 𝑧 𝑦; 𝜄 ∆(𝑧, ො 𝑧) NP VP VP NP NP ො NP 𝑧 NP idea. The man had an The man had an idea. ∆(𝑧, ො 𝑧) 𝛼𝑆 𝜄 = ෍ 𝑞 ො 𝑧 𝑦; 𝜄 ∆ 𝑧, ො 𝑧 𝛼 log 𝑞(ො 𝑧|𝑦; 𝜄) ො 𝑧 addresses addresses compute in exposure bias loss the same way (compute by mismatch as for the sampling) (compute F1) true tree [Williams, 1992]

Policy Gradient Training 𝛼𝑆 𝜄 = ෍ 𝑞 ො 𝑧 𝑦; 𝜄 ∆ 𝑧, ො 𝑧 𝛼 log 𝑞(ො 𝑧|𝑦; 𝜄) ො 𝑧 Input, 𝑦 The cat took a nap. S S-INV S S k candidates, ො 𝑧 NP VP ADJP VP VP NP NP NP NP NP NP NP NP took a nap . The cat took a nap . nap . nap . The cat The cat took a The cat took a ∆(𝑧, ො 𝑧) −89 −80 −80 −100 (negative F1) ∗ ∗ ∗ ∗ gradient 𝛼 log 𝑞(ො 𝑧 1 |𝑦; 𝜄) 𝛼 log 𝑞(ො 𝑧 2 |𝑦; 𝜄) 𝛼 log 𝑞(ො 𝑧 3 |𝑦; 𝜄) 𝛼 log 𝑞(𝑧|𝑦; 𝜄) for candidate

Experiments

Setup Parsers Training Span-Based [Cross & Huang, 2016] Static oracle x Top-Down [Stern et al. 2016] Dynamic oracle RNNG [Dyer et al. 2016] Policy gradient In-Order [Liu and Zhang, 2017]

English PTB F1 93 Static oracle Policy gradient Dynamic oracle 92.5 92 91.5 91 90.5 90 Span-Based Top-Down RNNG-128 RNNG-256 In-Order

Training Efficiency PTB learning curves for the Top-Down parser 92 91.5 Development F1 91 90.5 90 static oracle dynamic oracle policy gradient 89.5 89 5 10 15 20 25 30 35 40 45 Training Epoch

French Treebank F1 84 Static oracle Policy gradient Dynamic oracle 83 82 81 80 Span-Based Top-Down RNNG-128 RNNG-256 In-Order

Chinese Penn Treebank v5.1 F1 88 Static oracle Policy gradient Dynamic oracle 87 86 85 84 83 Span-Based Top-Down RNNG-128 RNNG-256 In-Order

Conclusions ‣ Local decisions can have non-local consequences ‣ Loss mismatch ‣ Exposure bias ‣ How to deal with the issues caused by local decisions? ‣ Dynamic oracles: efficient, model specific ‣ Policy gradient: slower to train, but general purpose

Thank you!

For Comparison: A Novel Oracle for RNNG (S (NP The man ) (VP had … 1. Close current constituent if it’s a true constituent… ) (S (NP The man … or it could never be a true constituent. (S (VP ) (NP ) The man 2. Otherwise, open the outermost unopened true constituent at this position. (S (NP The man ) (VP 3. Otherwise, shift the next word. (S (NP The man had ) (VP

Policy Gradient as a Proxy for Dynamic Oracles in Constituency - PowerPoint PPT Presentation

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Daniel Fried and Dan Klein Parsing by Local Decisions S VP NP NP nap . a The cat took (S (NP The cat ) (VP = log ; ) =

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

I n t e r n s L i g h t n i n g T a l k s Proxy editing PiTiVi Proxy editing

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

C# Design Patterns: Proxy APPLYING THE PROXY PATTERN Steve Smith FORCE MULTIPLIER FOR DEV TEAMS

Format Oracles on OpenPGP F. Maury J.-R. Reinhard O. Levillain H. Gilbert ANSSI, France

Oracles and Tokens Prof. Tom Austin San Jos State University Oracles Motivation EVM

Oracles in TTCN-3 and UTP Ina Schieferdecker 2012, May 22nd, CREST Workshop, London Outline

Automated Test Oracles Automated Test Oracles for GUIs for GUIs Eighth International Symposium

January 29, 2018 Proxy Statements under Maryland Law 2018 The 2018 proxy season is here.

Istio A modern service mesh Louis Ryan Principal Engineer @ Google @louiscryan My Google

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

SWEN 383 Software Design Principles & Patterns The Proxy Pattern Basic Proxy * Overview

Proxy Server, Network Address Translator, Firewall 1 Proxy Server 2 1 Introduction What

Currently in trunk Name: testsupport The proxy module Version: 0.4-SNAPSHOT depends on the

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Tacoma Narrows and the Gradient Vector Ken Huffman

Adaptive Incremental Learning for Statistical Relational Models Using Gradient-Based Boosting

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation Yue Ning 1 Yue

A High Resolution Vertical Gradient Approach for Delineation of Hydrogeologic Units at a

Modeling Velocity Gradients in an OBC, First-Break Positioning Algorithm Noel Zinn Western

Commercialization Opportunities in the Chemical Industry an Individual Inventors Perspective

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Policy Gradient as a Proxy for Dynamic Oracles in Constituency - PowerPoint PPT Presentation

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Daniel Fried and Dan Klein Parsing by Local Decisions S VP NP NP nap . a The cat took (S (NP The cat ) (VP = log ; ) =

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

I n t e r n s L i g h t n i n g T a l k s Proxy editing PiTiVi Proxy editing

MySQL Proxy Making MySQL more flexible Jan Kneschke jan@mysql.com MySQL Proxy proxy-servers

C# Design Patterns: Proxy APPLYING THE PROXY PATTERN Steve Smith FORCE MULTIPLIER FOR DEV TEAMS

Format Oracles on OpenPGP F. Maury J.-R. Reinhard O. Levillain H. Gilbert ANSSI, France

Oracles and Tokens Prof. Tom Austin San Jos State University Oracles Motivation EVM

Oracles in TTCN-3 and UTP Ina Schieferdecker 2012, May 22nd, CREST Workshop, London Outline

Automated Test Oracles Automated Test Oracles for GUIs for GUIs Eighth International Symposium

January 29, 2018 Proxy Statements under Maryland Law 2018 The 2018 proxy season is here.

Istio A modern service mesh Louis Ryan Principal Engineer @ Google @louiscryan My Google

MySQL Proxy meets: binlogs Jan Kneschke MySQL Enterprise Tools mailto: jan@mysql.com What is

SWEN 383 Software Design Principles &amp; Patterns The Proxy Pattern Basic Proxy * Overview

Proxy Server, Network Address Translator, Firewall 1 Proxy Server 2 1 Introduction What

Currently in trunk Name: testsupport The proxy module Version: 0.4-SNAPSHOT depends on the

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Tacoma Narrows and the Gradient Vector Ken Huffman

Adaptive Incremental Learning for Statistical Relational Models Using Gradient-Based Boosting

Lecture 17: Boosting CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation Yue Ning 1 Yue

A High Resolution Vertical Gradient Approach for Delineation of Hydrogeologic Units at a

Modeling Velocity Gradients in an OBC, First-Break Positioning Algorithm Noel Zinn Western

Commercialization Opportunities in the Chemical Industry an Individual Inventors Perspective

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

SWEN 383 Software Design Principles & Patterns The Proxy Pattern Basic Proxy * Overview