Language and Vision EECS 442 Prof. David Fouhey Winter 2019, - PowerPoint PPT Presentation

Language and Vision EECS 442 – Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/

Administrivia • Last class! • Poster session later today • Turn in project reports anytime up until Sunday . We’ll try to grade them as they come in. • Fill out course feedback forms if you haven’t already • Enjoy your summers. Remember to relax (for everyone) and celebrate (for those graduating)

Project Reports • Look at the syllabus for roughly what we’re looking for. Make sure you cover everything. • Pictures (take up space and are really important): half my papers are pictures • Copy/paste your proposal and progress report in, smoothen the text, add a few results.

Quick – what’s this? Dog image credit: T. Gupta

Previously on EECS 442 Feature vector from image Cat weight vector Cat score 0.2 -0.5 0.1 2.0 1.1 56 -96.8 Dog weight vector Dog score 1.5 1.3 2.1 0.0 3.2 231 437.9 Hat weight vector Hat score 0.0 0.3 0.2 -0.3 -1.2 24 61.95 2 𝑿𝒚 𝒋 𝑿 1 Prediction is vector Weight matrix a collection of where jth component is 𝒚 𝒋 scoring functions, one per class “score” for jth class. Diagram by: Karpathy, Fei-Fei

Previously on EECS 442 Converting Scores to “Probability Distribution” e -0.9 -0.9 0.41 0.11 Cat score P(cat) e 0.6 Norm 0.6 exp(x) 1.82 0.49 Dog score P(dog) e 0.4 0.4 1.49 0.40 Hat score P(hat) ∑=3.72 exp (𝑋𝑦 𝑘 ) Generally P(class j): σ 𝑙 exp( 𝑋𝑦 𝑙 )

What’s a Big Issue? Is it a dog? Is it a hat?

Take 2 Feature vector from image Cat weight vector Cat score 0.2 -0.5 0.1 2.0 1.1 56 -96.8 Dog weight vector Dog score 1.5 1.3 2.1 0.0 3.2 231 437.9 Hat weight vector Hat score 0.0 0.3 0.2 -0.3 -1.2 24 61.95 2 𝑿𝒚 𝒋 𝑿 1 Prediction is vector Weight matrix a collection of where jth component is 𝒚 𝒋 scoring functions, one per class “score” for jth class. Diagram by: Karpathy, Fei-Fei

Take 2 Converting Scores to “Probability Distribution” -1.9 0.13 Cat score P(cat) 1.2 sgm(x) 0.77 Dog score P(dog) 0.9 0.71 Hat score P(hat) 77% dog 71% hat 13% cat?

Hmm… • We’d like to say: “dog with a hat” or “husky wearing a hat” or something else. • Naïve approach (given N words to choose from and up to C words). How many? 𝑂 𝑗 classes to choose from (~N i ) 𝐷 • σ 𝑗=1 • N=10k, C=5 -> 100 billion billion • Can’t train 100 billion billion classifiers

Hmm… • Pick N- word dictionary, call them class 1, …, N • New goal: emit sequence of C N-way classification outputs • Dictionary could be: • All the words that appear in training set • All the ascii characters • Typically includes special “words”: START, END, UNK

Option 1 – Sequence Modeling Output at i is linear transformation of hidden state y i 𝒛 𝒋 = 𝑿 𝒛𝒊 𝒊 𝒋 YH Hidden state at i is linear function h i-1 h i HH of previous hidden state and input HX at i, + nonlinearity 𝒊 𝒋 = 𝝉(𝑿 𝒊𝒚 𝒚 𝒋 + 𝑿 𝒊𝒊 𝒊 𝐣−𝟐 ) x i

Option 1 – Sequence Modeling Can stack arbitrarily to y i y i+1 create a function of YH multiple inputs with YH multiple outputs that’s h i-1 h i h i+1 in terms of parameters HH HH W HX , W HH , W YH HX HX x i x i+1 𝒛 𝒋 = 𝑿 𝒛𝒊 𝒊 𝒋 𝒊 𝒋 = 𝝉(𝑿 𝒊𝒚 𝒚 𝒋 + 𝑿 𝒊𝒊 𝒊 𝐣−𝟐 )

Option 1 – Sequence Modeling Loss i Loss i+1 Can define a loss with y i y i+1 respect to each output YH and differentiate wrt to YH all the weights h i-1 h i h i+1 HH HH Backpropagation HX HX through time x i x i+1 𝒛 𝒋 = 𝑿 𝒛𝒊 𝒊 𝒋 𝒊 𝒋 = 𝝉(𝑿 𝒊𝒚 𝒚 𝒋 + 𝑿 𝒊𝒊 𝒊 𝐣−𝟐 )

Captioning Dog in a hat END h 0 h 1 h 5 h 2 h 3 h 4 0 x 1 x 2 x 3 x 4 x 5 C C C C C CNN 𝐽𝑛 ∈ 𝑆 4096 𝑔 START Dog in a hat

Captioning Each step: look at input and a hat hidden state (more on that in a second) and decide output. h 3 h 4 Can learn through CNN! x 4 C CNN 𝐽𝑛 ∈ 𝑆 4096 𝑔 a

Results Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Donahue et al. TPAMI, CVPR 2015.

Captioning – Looking at Each Step Why might this be better a hat than doing billions of classification problems? h 3 h 4 x 4 C CNN 𝐽𝑛 ∈ 𝑆 4096 𝑔 a

What Goes On Inside? • Great repo for playing with RNNs (Char-RNN) • https://github.com/karpathy/char-rnn • (Or search char-rnn numpy) • Tokens are just the characters that appear in the training set Result credit: A. Karpathy

Sample Trained on Linux Code /* * If this error is set, we will need anything right after that BSD. */ static void action_new_function(struct s_stat_info *wb) { unsigned long flags; int lel_idx_bit = e->edd, *sys & ~((unsigned long) *FIRST_COMPAT); buf[0] = 0xFFFFFFFF & (bit << 4); min(inc, slist->bytes); printk(KERN_WARNING "Memory allocated %02x/%02x, " "original MLL instead\n"), min(min(multi_run - s->len, max) * num_data_in), frame_pos, sz + first_seg); div_u64_w(val, inb_p); spin_unlock(&disk->queue_lock); mutex_unlock(&s->sock->mutex); mutex_unlock(&func->mutex); return disassemble(info->pending_bh); } Result credit: A. Karpathy

Sample Trained on Names Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen Hammine Janye Marlise Jacacrie Hendred Romand Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine Charyanne Sales Result credit: A. Karpathy

What Goes on Inside Outputs of an RNN. Blue to red show timesteps where a given cell is active. What’s this? Result credit: A. Karpathy

Nagging Detail #1 – Depth What happens to really deep networks? Remember g n for g ≠ 1 Gradients explode / vanish D E E P _ L E A R N END h 0 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 h 9 h h 8 1 0 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x x 8 1 0 START D E E P _ L E A R N

Nagging Detail #1 – Depth • Typically use more complex methods that better manage gradient flowback (LSTM, GRU) • General strategy: pass the hidden state to the next timestep as unchanged as possible, only adding updates as necessary

Nagging Detail #2 Lots of captions are in principle possible! • A dog in a hat • A dog wearing a hat • Husky wearing a hat • Husky holding a camera, sitting in grass • A dog that’s in a hat, sitting on a lawn with a camera

Nagging Detail #2 – Sampling Dog (P=0.3), A (P=0.2), Husky (P=0.15), …. • Pick proportional to h 0 h 1 probability of each word 0 • Can adjust “temperature” x 1 parameter exp(score/t) to equalize probabilities C • exp(5) / exp(1) → 54.6 CNN 𝐽𝑛 ∈ 𝑆 4096 𝑔 • exp(5/5) / exp(1/5) → 2.2 START

Effect of Temperature • Train on essays about startups and investing • Normal Temperature: “ The surprised in investors weren’t going to raise money. I’m not the company with the time there are all interesting quickly, don’t have to get off the same programmers. There’s a super -angel round fundraising, why do you can do .” • Low temperature: “ is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same”

Nagging Detail #2 – Sampling Dog (P=0.4), Husky A (P=0.3), …. h 0 h 1 h 2 0 x 1 x 2 C C CNN 𝐽𝑛 ∈ 𝑆 4096 𝑔 START A

Nagging Detail #2 – Sampling wearing (P=0.5), in Dog (P=0.3), …. Each evaluation gives h 0 h 1 h 2 P(w i |w 1 ,…,w i-1 ) 0 x 1 x 2 Can expand a finite tree of possibilities (beam C C search) and pick most CNN 𝐽𝑛 ∈ 𝑆 4096 𝑔 likely sequence START Dog

Nagging Detail #3 – Evaluation Computer: “A husky in a hat” Human: “A dog in a hat” How do you decide? 1) Ask humans. Why might this be an issue? 2) In practice: use something like precision (how many generated words appear in ground-truth sentences) or recall. Details very important to prevent gaming (e.g., “A a a a a”)

More General Sequence Models Positive Can have multiple Review inputs, single output h 0 h 1 h 5 h 2 h 3 h 4 0 x 1 x 2 x 3 x 4 x 5 I loved my meal here

More General Sequence Models Could be a feature vector! h 0 h 1 h 5 h 2 h 3 h 4 0 x 1 x 2 x 3 x 4 x 5 What is the dog wearing

More General Models 0.03 Bat 0.00 Dolphin Hat 0.5 … 0.2 Grass h 0 h 1 h 5 h 2 h 3 h 4 0 x 1 x 2 x 3 x 4 x 5 is the dog wearing What

Language and Vision EECS 442 Prof. David Fouhey Winter 2019, - PowerPoint PPT Presentation

Language and Vision EECS 442 Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/ Administrivia Last class! Poster session later today Turn in project reports anytime up

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Developmental Developmental Disorders affecting Disorders affecting language language

CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and

Polar Cap & Y-Point Theory & PIC Simulation Mikhail (Mike) Belyaev UC Berkeley TAC

ComputationalModeling September 30, 2018 1 Lecture 15: Computational Modeling CBIO (CSCI)

Network Flow Truck company: Wants to send as many trucks as possible from s to t. Limit 1 of

Optimization COMP 520 Fall 2010 Optimization (2) The optimizer focuses on: reducing the

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

The Complete EMR: Leveraging Informed Consent Capability John C. Frenzel, MD, MS Associate

Missional Communities formerly Home Groups Our B3 Movement Lord willing, our missional

Language and Vision EECS 442 Prof. David Fouhey Winter 2019, - PowerPoint PPT Presentation

Language and Vision EECS 442 Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/ Administrivia Last class! Poster session later today Turn in project reports anytime up

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria &amp; Viognier VISION

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Vision Services Vision Services &amp; &amp; Vision Therapy Vision Therapy February 2, 2007

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Developmental Developmental Disorders affecting Disorders affecting language language

CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and

Polar Cap &amp; Y-Point Theory &amp; PIC Simulation Mikhail (Mike) Belyaev UC Berkeley TAC

ComputationalModeling September 30, 2018 1 Lecture 15: Computational Modeling CBIO (CSCI)

Network Flow Truck company: Wants to send as many trucks as possible from s to t. Limit 1 of

Optimization COMP 520 Fall 2010 Optimization (2) The optimizer focuses on: reducing the

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

The Complete EMR: Leveraging Informed Consent Capability John C. Frenzel, MD, MS Associate

Missional Communities formerly Home Groups Our B3 Movement Lord willing, our missional

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

Polar Cap & Y-Point Theory & PIC Simulation Mikhail (Mike) Belyaev UC Berkeley TAC