Distilling Knowledge for Search-based Structured Prediction Yijia - PowerPoint PPT Presentation

Distilling Knowledge for Search-based Structured Prediction Yijia Liu*, Wanxiang Che, Huaipeng Zhao, Bing Qin, Ting Liu Research Center for Social Computing and Information Retrieval Harbin Institute of Technology

Complex Model Wins [ResNet, 2015] [He+, 2017]

93 26.5 92.5 25 92 91.5 23.5 91 22 90.5 90 20.5 Dependency Parsing NMT Baseline search SOTA Distillation Ensemble Baseline search SOTA Distillation Ensemble

93 26.5 92.5 0.8 0.6 25 92 2.6 1.3 91.5 23.5 91 22 90.5 90 20.5 Dependency Parsing NMT Baseline search SOTA Distillation Ensemble Baseline search SOTA Distillation Ensemble

Classification vs. Structured Prediction ! " Classifier Structured ! " # , " % , … , y ( Predictor

Classification vs. Structured Prediction I like this book Classifier Structured I like this book �� Predictor

Search-based Structured Prediction I like this book �� Search Space

! " # that Controls Search Process p(y | I, like ) 1 0.5 0 book i like love the this I like this book �� Search Space

Generic ! " # Learning Algorithm $ (y= this ) argmax p(y | I, like ) 1 0.5 0 book I like love the this I like this book �� Search Space

Problems of the Generic Learning Algorithm Ambiguities in training data “both this and the seems reasonable” I like this book �� the Search Space

Problems of the Generic Learning Algorithm Ambiguities in training data Training and test discrepancy “both this and the seems reasonable” “What if I made wrong decision?” I like this book �� love the ? Search Space

Solutions in Previous Works Ambiguities in training data Training and test discrepancy Ensemble (Dietterich, 2000) Explore (Ross and Bagnell, 2010) I like this book �� love the Search Space

Where We Are Knowledge Distillation Ambiguities in training data Training and test discrepancy I like this book �� love the Search Space

Knowledge Distillation Learning from negative log-likelihood Learning from knowledge distillation ! (y= this ) argmax p(y | I, like ) argmax sum y q(y) p(y | I, like ) 1 1 0.5 0.5 0 0 book I like love the this book I like love the this " # $, &'()) is the output distribution of a teacher model (e.g. ensemble) On supervised data argmax 0 ! (y= this ) p(y | I, like ) sum y q(y) p(y | I, like ) 1 1 1 − 3 + 3 0.5 0.5 0 0 book I like love the this book I like love the this

Knowledge Distillation: from Where Learning from knowledge distillation argmax sum y q(y) p(y | I, like ) 1 0.5 0 book I like love the this Ambiguities in training data Ensemble (Dietterich, 2000) We use ensemble of M structure predictor as the teacher q

KD on Supervised (reference) Data ! (y=this) p(y | I, like ) sum y q(y) p(y | I, like ) 1 1 1 − $ + $ ! (y=this) 0.5 0.5 0 0 book I like love the this book I like love the this I like this book �� the Search Space

KD on Explored Data sum y q(y) p(y | I, like, the ) 1 0.5 0 book I like love the this I like book �� this the Training and test discrepancy Search Space Explore (Ross and Bagnell, 2010) We use teacher q to explore the search space & learn from KD on the explored data

We combine KD on reference and explored data

Experiments Transition-based Dependency Parsing LAS Neural Machine Translation BLEU Penn Treebank (Stanford dependencies) IWSLT 2014 de-en Baseline 22.79 Baseline 90.83 Ensemble (20) 92.73 Ensemble (10) 26.26 Distill (reference, ! = 1.0 ) Distill (reference, ! = 0.8 ) 91.99 24.76 Distill (exploration) 92.00 Distill (exploration) 24.64 Distill (both) 92.14 Distill (both) 25.44 MIXER (Ranzato et al. 2015) 20.73 Ballesteros et al. (2016) (dyn. oracle) 91.42 Andor et al. (2016) (local, B=1) 91.02 Wiseman and Rush (2016) (local B=1) 22.53 Wiseman and Rush (2016) (global B=1) 23.83

Analysis: Why the Ensemble Works Better? • Examining the ensemble on the “ problematic ” states. Optimal-yet-ambiguous Non-optimal I like this book �� love the

Analysis: Why the Ensemble Works Better? • Examining the ensemble on the “ problematic ” states. • Testbed: Transition-based dependency parsing . • Tools: dynamic oracle , which returns a set of reference actions for one state. • Evaluate the output distributions against the reference actions. optimal-yet-ambiguous non-optimal Baseline 68.59 89.59 Ensemble 74.19 90.90

Analysis: Is it Feasible to Fully Learn from KD w/o NLL? 27.13 92.07 27.04 92.04 26.96 26.95 91.93 91.9 26.64 26.6 91.72 91.7 26.37 26.21 91.55 26.09 91.49 25.9 91.3 91.1 24.93 90.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Transition-based Parsing Neural Machine Translation Fully learning from KD is feasible

Analysis: Is Learning from KD Stable? Transition-based Parsing Neural Machine Translation

Conclusion • We propose to distill an ensemble into a single model both from reference and exploration states. • Experiments on transition-based dependency parsing and machine translation show that our distillation method significantly improves the single model’s performance. • Analysis gives empirically guarantee for our distillation method.

Thanks and Q/A

Distilling Knowledge for Search-based Structured Prediction Yijia - PowerPoint PPT Presentation

Distilling Knowledge for Search-based Structured Prediction Yijia Liu*, Wanxiang Che, Huaipeng Zhao, Bing Qin, Ting Liu Research Center for Social Computing and Information Retrieval Harbin Institute of Technology Complex Model Wins [ResNet,

Brewing and Distilling BSc Brewing and Distilling @ Heriot-Watt? International Centre for

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Most awarded craft distillery in North America in 2014, 2015 & 2016 by the American Distilling

Water Reduction Phillips Distilling Company Nathaniel Scherer Project Advisor: Michelle Gage

Cloud Adoption in the Enterprise Distilling Facts from the Hype Steve Wylie, General Manager,

Tell Them Apart: Distilling Technology Differences from Crow-Scale Comparison Discussions Huang,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Real or Not Real: Does Scientific Accuracy Matter in Fiction? Presenter: Andrew Dutt Fiction

Zamia Metals Limited April 2012 ASX:ZGM Capital Structure (as of 11 April 2012) Established

Gold Coast Investor Showcase 22 June 2017 Disclaimer This presentation (the Presentation)

Who is eligible to apply for funding? Many organisations are eligible to apply for SIF funding for

A Graph-Based Definition of Distillation G.W. Hamilton and G. Mendel-Gleason School of Computing

Overview Name of session: Scaling up rural sanitation Title of presentation : Total

A2P5 PLANNING PROCESS A framework for using community goals & needs to create an area concept

Planning for the Green Futures Charrettes What is going to happen ? How do we get it done

Distilling Knowledge for Search-based Structured Prediction Yijia - PowerPoint PPT Presentation

Distilling Knowledge for Search-based Structured Prediction Yijia Liu*, Wanxiang Che, Huaipeng Zhao, Bing Qin, Ting Liu Research Center for Social Computing and Information Retrieval Harbin Institute of Technology Complex Model Wins [ResNet,

Brewing and Distilling BSc Brewing and Distilling @ Heriot-Watt? International Centre for

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Most awarded craft distillery in North America in 2014, 2015 &amp; 2016 by the American Distilling

Water Reduction Phillips Distilling Company Nathaniel Scherer Project Advisor: Michelle Gage

Cloud Adoption in the Enterprise Distilling Facts from the Hype Steve Wylie, General Manager,

Tell Them Apart: Distilling Technology Differences from Crow-Scale Comparison Discussions Huang,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Real or Not Real: Does Scientific Accuracy Matter in Fiction? Presenter: Andrew Dutt Fiction

Zamia Metals Limited April 2012 ASX:ZGM Capital Structure (as of 11 April 2012) Established

Gold Coast Investor Showcase 22 June 2017 Disclaimer This presentation (the Presentation)

Who is eligible to apply for funding? Many organisations are eligible to apply for SIF funding for

A Graph-Based Definition of Distillation G.W. Hamilton and G. Mendel-Gleason School of Computing

Overview Name of session: Scaling up rural sanitation Title of presentation : Total

A2P5 PLANNING PROCESS A framework for using community goals &amp; needs to create an area concept

Planning for the Green Futures Charrettes What is going to happen ? How do we get it done

Most awarded craft distillery in North America in 2014, 2015 & 2016 by the American Distilling

A2P5 PLANNING PROCESS A framework for using community goals & needs to create an area concept