Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 - PowerPoint PPT Presentation

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 Workshop @ NeurIPS 2019

Motivation • Named Entity Recognition (NER) is a widely used Information Extraction task in many industrial applications and use cases • Ramping up on a new domain can be difficult ▪ Lots of unlabeled data, little of no labeled data and often not good enough for training a model with good performance Solution A ? Hire a linguist or data scientist to tune/build model ? Hire annotators to label more data or buy similar dataset ? Time/compute resource limitations Solution B ? Pre-trained Language Models such as BERT, GPT, ELMo are great at low-resource scenarios ? Require great compute and memory resources and suffer from high latency in inference ? Deploying such models in production or on edge devices is a major issue This Photo by Unknown Author is licensed under CC BY 2

Enhancing a Compact Model • Approach: • Train a compact model (3M parameters) using a large pre-trained • Pre-trained word embeddings (non-shared embeddings) • Utilize labeled and unlabeled data: • Knowledge Distillation • Pseudo-labeling 3

Model training setup Unlabeled examples Models Teacher Model • Teacher – BERT-base/large (110M/340M params.) KL Divergence • Compact – LSTM-CNN with Softmax/CRF (3M params.) soft targets Distillation Loss Low-resource Dataset Simulation Labeled examples • CoNLL 2003 (English) – PER/ORG/DATE/MISC pseudo-labels • Generate random training sets with labeled/unlabeled examples Compact Model • Train set size: 150/300/750/1500/3000 labels soft targets Task Loss • Report averaged F1 (20 experiments per train set size) annotated labels Integrated model knowledge distillation and Training procedure pseudo-labeling in loss function 1. Fine-tune BERT with labeled data 2. Train compact model using modified loss CrossEntropy( ො 𝑧, 𝑧) 𝑚𝑏𝑐𝑓𝑚𝑓𝑒 𝑓𝑦𝑏𝑛𝑞𝑚𝑓 𝑀 𝑢𝑏𝑡𝑙 = ቊ CrossEntropy (ො 𝑧, ො 𝑧 𝑢𝑓𝑏𝑑ℎ𝑓𝑠 ) 𝑣𝑜𝑚𝑏𝑐𝑓𝑚𝑓𝑒 𝑓𝑦𝑏𝑛𝑞𝑚𝑓 𝑀 𝑒𝑗𝑡𝑢𝑗𝑚𝑚𝑏𝑢𝑗𝑝𝑜 = KL(𝑚𝑝𝑕𝑗𝑢𝑡 𝑢𝑓𝑏𝑑ℎ𝑓𝑠 ||𝑚𝑝𝑕𝑗𝑢𝑡 𝑑𝑝𝑛𝑞𝑏𝑑𝑢 ) 𝑀𝑝𝑡𝑡 = 𝛽 ⋅ 𝑀 𝑢𝑏𝑡𝑙 + 𝛾 ⋅ 𝑀 𝑒𝑗𝑡𝑢𝑗𝑚𝑚𝑏𝑢𝑗𝑝𝑜 , 𝛽 + 𝛾 = 1.0 4

Compact model performance BERT-base as teacher BERT-large as teacher 12.9% 6.1% 18.9% 16% 1 32 64 128 Batch size 1 32 64 128 Batch size Inference speed on CPU 8.1-10.6 85.2-100.4 109.5-123.8 123.6-137.8 Speedup 3.3-4.3 28.6-33.7 40-45.2 49.9-55.6 Speedup 5

Takeaways • Compact models perform equally well as pre-trained LM in low-resource scenarios, and with superior inference speed and with compression rate is 36x-113x vs. BERT • Compact models are preferable for deployment vs. pre-trained LM in such use-cases • Many directions to explore: • Compact model topology – how small/simple can we make the model? • Other NLP tasks, pre-trained LM • Other ways to utilize unlabeled data • Code available in Intel AI’s NLP Architect open source library NervanaSystems/nlp-architect 6

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 - PowerPoint PPT Presentation

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 Workshop @ NeurIPS 2019 Motivation Named Entity Recognition (NER) is a widely used Information Extraction task in many industrial applications and use cases Ramping up on

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI Lab) EMC 2 Workshop @ NeurIPS

SHIRA RUBINOFF PRESIDENT, PRIME TECH PARTNERS @SHIRASTWEET SHIRA@SHIRARUBINOFF.COM We Will

Prediction-based decisions & fairness: choices, assumptions, and definitions Shira Mitchell,

Multilevel Models for Estimating the Number of Deaths in Armed Conflict (in Colombia) Shira

CELL SELECTION IN Guy Grebla OFDMA WIRELESS NETWORKS Slides: Moshe Gabel MOSHE GABEL 1 MODERN

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

Driven Holographic CFTs Moshe Rozali University of British Columbia Numerical Holography

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

intel.com/cloudforall Legal Disclaimer OpenStack is a registered trademark of the OpenStack

2018 Intel retiree 2018 Intel retiree Medical plan Medical plan Changes Changes IRMP Cigna High

Due to Code Placement in IA Zia Ansari zia.ansari@intel.com Intel Corporation 11/03/16 Agenda

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel

CMS TMS Integration How Using Available Standards Could Help Loc Dufresne de Virel

United States Court of Appeals for the Federal Circuit __________________________ INTEL

Investor Presentation November 29, 2017 Weebit Nano AT A GLANCE Listed on the ASX in August 2016

Intel Capital Investing in Global Innovation January 2014 Strategic, Long Term Investor

Privacy through Accountability: A Computer Science Perspective Anupam Datta Associate Professor

Creating an intelligent agent for StarCraft: Brood War University of Lige Faculty of Applied

StarCraft II as an Environment for Artificial Intelligence Research Timo Ewalds - DeepMind

Intelligent Driving Agents Intelligent Driving Agents Microscopic traffic simulation with

Brightspace Refresher r Camp Table of Contents Announcements Rubrics Platform

Sambuz

Useful Links

Newsletter

Mail Us

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 - PowerPoint PPT Presentation

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 Workshop @ NeurIPS 2019 Motivation Named Entity Recognition (NER) is a widely used Information Extraction task in many industrial applications and use cases Ramping up on

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI Lab) EMC 2 Workshop @ NeurIPS

SHIRA RUBINOFF PRESIDENT, PRIME TECH PARTNERS @SHIRASTWEET SHIRA@SHIRARUBINOFF.COM We Will

Prediction-based decisions &amp; fairness: choices, assumptions, and definitions Shira Mitchell,

Multilevel Models for Estimating the Number of Deaths in Armed Conflict (in Colombia) Shira

CELL SELECTION IN Guy Grebla OFDMA WIRELESS NETWORKS Slides: Moshe Gabel MOSHE GABEL 1 MODERN

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

Driven Holographic CFTs Moshe Rozali University of British Columbia Numerical Holography

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

intel.com/cloudforall Legal Disclaimer OpenStack is a registered trademark of the OpenStack

2018 Intel retiree 2018 Intel retiree Medical plan Medical plan Changes Changes IRMP Cigna High

Due to Code Placement in IA Zia Ansari zia.ansari@intel.com Intel Corporation 11/03/16 Agenda

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel

CMS TMS Integration How Using Available Standards Could Help Loc Dufresne de Virel

United States Court of Appeals for the Federal Circuit __________________________ INTEL

Investor Presentation November 29, 2017 Weebit Nano AT A GLANCE Listed on the ASX in August 2016

Intel Capital Investing in Global Innovation January 2014 Strategic, Long Term Investor

Privacy through Accountability: A Computer Science Perspective Anupam Datta Associate Professor

Creating an intelligent agent for StarCraft: Brood War University of Lige Faculty of Applied

StarCraft II as an Environment for Artificial Intelligence Research Timo Ewalds - DeepMind

Intelligent Driving Agents Intelligent Driving Agents Microscopic traffic simulation with

Brightspace Refresher r Camp Table of Contents Announcements Rubrics Platform

Sambuz

Useful Links

Newsletter

Mail Us

Prediction-based decisions & fairness: choices, assumptions, and definitions Shira Mitchell,