Learning Architectures and Loss Functions in Continuous Space Fei - PowerPoint PPT Presentation

Learning Architectures and Loss Functions in Continuous Space Fei Tian Machine Learning Group Microsoft Research Asia

Self-Introduction • Researcher @ MSRA Machine Learning Group • Joined in July, 2016 • Research Interests: • Machine Learning for NLP (especially NMT) • Automatic Machine Learning • More Information: https://ustctf.github.io

Outline • Overview • Efficiently optimizing continuous decisions • Loss Function Teaching • Continuous space for discrete decisions • Neural Architecture Optimization

Automatic Machine Learning Automate every decision in machine learning Architectures, Learning rate, Depth, Dropout, Width, Weight decay, Batch size, Temperature, … …

Why Continuous Space? • Life is easier if we have gradients • For example, we have a bunch of powerful gradient-based optimization algorithms • Representation is compact • One of |𝑊| representations of words V.S. word embeddings

The Role of Continuous Space in AutoML • For continuous decisions • For discrete decisions • How to efficiently optimize them? • How to effectively cast them into continuous space? • And the more important, elegantly • Our work: • Our work: Neural Architecture Optimization Loss Function Teaching

Learning to Teach with Dynamic Loss Functions Lijun Wu, Fei Tian, Yingce Xia, Tao Qin, Tie-Yan Liu NeurIPS 2018

Loss Function Teaching • Recap to loss function 𝑀 𝑔 𝜕 𝑦 , 𝑧 • Typical examples: L(f 𝜕 X , y) • Cross-Entropy: 𝑀 = −log 𝑞 𝑦 ⋅ Ԧ 𝑧 , Ԧ 𝑧 𝑗 = 𝟐 𝑗=𝑧 • Maximum Margin: 𝑀 = max 𝑧 ′ ≠𝑧 log 𝑞 𝑧 ′ − log 𝑞 𝑧 f 𝜕 (X) Y • Learning objective of 𝑔 𝜕 : • Minimize 𝑀 𝜖𝑀 • 𝜕 𝑢 = 𝜕 𝑢−1 − 𝜃 f ω 𝜖𝜕 𝑢−1 • Objective of loss function teaching: X Discover best loss function 𝑀 to train student model 𝑔 𝜕 • Ultimate goal: improve the performance of 𝑔 𝜕 8

Why is it called “Teaching”? • If we view model 𝑔 𝜕 as students , then 𝑀 is the exams • Good teachers are adaptive : • They set good exams according to the status of the students • An analogy: • Data 𝑦, 𝑧 is the textbook • Curriculum learning schedules the textbooks (data) per the status of the student model

Can We Achieve Automatic Teaching? • The first task: design a good decision space • Our way: use another (parametric) neural network 𝑀 𝜚 (𝑔 𝜕 𝑦 , 𝑧) as the loss function • The decision space: coefficients 𝜚 • It is continuous

Automatic Loss Function Teaching, cont. • Assume the loss function itself is a neural network • 𝑀 𝜚 (𝑔 𝜕 𝑦 , 𝑧) , with 𝜚 as its coefficient L(f 𝜕 X , y) • For example, generalized cross-entropy loss • 𝑀 𝜚 = 𝜏(− log 𝑈 𝑞 𝑦 W Ԧ 𝜚 𝑧 + 𝑐) μ θ • 𝜚 = {𝑋, 𝑐} f 𝜕 (X) Y • A parametric teacher model 𝜈 𝜄 f ω • Output 𝜚 • 𝜚 = 𝜈 𝜄 X 11

How to Be Adaptive? • Extract feature 𝑡 𝑢 at different training step 𝑢 of student model 𝑔 𝜕 𝜚 𝑢 • The coefficients are adaptive • 𝜚 𝑢 = 𝜈 𝜄 (𝑡 𝑢 ) , generating adaptive loss functions 𝑀 𝜚 𝑢 (𝑔 𝜕 𝑦 , 𝑧) 𝑡 𝑢 12

How to Optimize the Teacher Model? • Hyper gradient 𝜖 2 𝑀 𝑢𝑠𝑏𝑗𝑜 (𝜕 𝑈−1 ) 𝜖𝑀 𝑒𝑓𝑤 𝜖𝑀 𝑒𝑓𝑤 𝜖𝜕 𝑈 𝜖𝑀 𝑒𝑓𝑤 𝜖𝜕 𝑈−𝑢 • 𝜖𝜚 = 𝜖𝜚 = 𝜖𝜕 𝑈 ( − 𝜃 𝑈−1 ) 𝜖𝜕 𝑈 𝜖𝜚 𝜖𝜕 𝑈−1 𝜖𝜚

Neural Machine Translation Experiment BLEU ON WMT2014 ENGLISH→GERMAN TRANSLATION Cross Entropy Reinforcement Learning L2T 29.1 28.7 28.4 Transformer 14

Experiments: Image Classification • On CIFAR-10 ERROR RATE (%) OF CIFAR-10 ERROR RATE (%) OF CIFAR-100 CLASSIFICATION CLASSIFICATION Cross Entropy Large Margin Softmax L2T Cross Entropy Large Margin Softmax L2T 30.38 7.51 30.12 29.25 7.01 6.56 19.93 19.75 18.98 3.8 3.69 3.38 RestNet-32 Wide RestNet RestNet-32 Wide RestNet 3/20/2019

Till now… • We talked about how to set continuous decisions for a particular AutoML task • And how to effectively optimize it • But what would if the design space is discrete ?

Neural Architecture Optimization Renqian Luo, Fei Tian, Tao Qin, En-Hong Chen, Tie-Yan Liu NeurIPS 2018

The Background: Neural Architecture Search • There might be no particular need to introduce the basis… • Two mainstream algorithms: • Reinforcement Learning and Evolutionary Computing

How to Cast the Problem into Continuous Space? • Intuitive Idea Map the (discrete) architectures into continuous embeddings -> Optimize the embeddings -> Revert back to the architectures • How to optimize? • Use the help of a performance predictor function 𝑔

How NAO Works? Decoder Encoder Architecture 𝑦 Optimized Architecture 𝑦′ output surface of performance prediction function 𝒈 𝒇 𝒚 ′ 𝒇 𝒚 𝝐𝒈 ′ = 𝒇 𝒚 + 𝜽 Gradient Ascent: 𝒇 𝒚 𝝐𝒇 embedding space of all architectures

Why the Encoder (including perf predictor) Could Work? Two Tricks • Normalize the performance into (0,1) • Sometimes even with CDF • Data augmentation 𝑦, 𝑧 → (𝑦 ′ , 𝑧) , if 𝑦 and 𝑦′ are symmetric • • Improve the pairwise accuracy by 2% on CIFAR-10

Why the Decoder (i.e., perfect recovery) Could Work? • Sentence-wise AutoEncoder with attention mechanism is easy to train • You can even obtain near 100 BLEU on test set! • So sometimes need perturbations to avoid trivial solution (e.g., in unsupervised machine translation [1,2]) • 𝑔 happens to be the perturbation 1. Artetxe, Mikel, et al. "Unsupervised neural machine translation." ICLR 2018 2. Lample, Guillaume, et al. "Unsupervised machine translation using monolingual corpora only." ICLR 2018

Experiments: CIFAR-10 Method Error Rate Resource (#GPU × #Hours) ENAS 2.89 12 NAO-WS 2.80 7 AmoebaNet 2.13 3150 * 24 Hie-EA 3.15 300 * 24 NAO 2.10 200 * 24

Experiments: Transfer to CIFAR-100

Experiments: PTB Language Modelling Method Perplexity Resource (#GPU × #Hours) NASNet 62.4 1e4 CPU days ENAS 58.6 12 NAO 56.0 300 NAO-WS 56.4 8

Experiments: Transfer to WikiText2

Open Source • https://github.com/renqianluo/NAO

Thanks! We are hiring! Send me a message if you are interested: fetia@microsoft.com

The Panel Discussion • AutoML 具体包括什么 ( 网络结构搜索，超参数搜索，传统机器学习模型等 )? • AutoML 与 meta-learning 的关系？ • NAS 的局限性？如何完全除去人为干预？ • NAS 与 representation/transfer learning ？ • 如何看待 Random Search and Reproducibility for NAS • RL or ES or SGD, gradient-based NAS 是未来吗？

Learning Architectures and Loss Functions in Continuous Space Fei - PowerPoint PPT Presentation

Learning Architectures and Loss Functions in Continuous Space Fei Tian Machine Learning Group Microsoft Research Asia Self-Introduction Researcher @ MSRA Machine Learning Group Joined in July, 2016 Research Interests: Machine

Architectures Architectural styles Software architectures Architectures versus middleware

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Hearing Loss Hearing Loss and and Relationships Relationships Shanna Groves and Melissa Frye

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

Repetitive Loss Properties and the CRS NFIP/Community Rating System Visual 10.1 Repetitive Loss

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

What is it? A casualty loss is defined as the damage, destruction, or loss of property

Water Loss May 14, 2020 Water Loss Regulation Overview Senate Bill 555 Validated AWWA

Water Loss Water Research Foundation How to use the Free Water Loss Audit Software v 5.0 2

INTRODUCING Connecting Weight Loss Patients Directly to your Weight Loss Center Physicians Weight

Learning Loss for Active Learning Rymarczyk D., Zieliski B., Tabor J., Sadowski M., Titov M.

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Architectures, Architectures, Microkernels, IPC, Microkernels, IPC, Capabilities Capabilities

Overview Agent Architectures Definition of agent architecture Classical Architectures for

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

A Unified View of Loss Functions in Supervised Learning Shuiwang Ji Department of Computer

Symbolic Memory Graphs invariant and corresponding optimizations for SMGCPA Anton Vasilyev

Lecture 2: Single processor architecture and memory David Bindel 27 Jan 2010 Logistics If

Memory Corruption Vulnerabilities, Part I Gang Tan Penn State University Spring 2019 CMPSC

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

Lecture 10: Neural Networks (Part 2) Feb 25th, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

An Introductory Tutorial on Implementing DRL Algorithms with DQN and TensorFlow Tim Tse May 18,

Lecture 13 Deep Belief Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus