Modeling Mutual Context of Object and Human Pose in Human-object - - PowerPoint PPT Presentation

▶

Dec 03, 2023 190 likes •537 views

Modeling Mutual Context of Object and Human Pose in Human-object Interaction Activities Bangpeng Yao Li Fei-Fei Presented by Sahil Shah Agenda Introduction Problem Formulation Learning Inference Results Agenda

SLIDE 1

Modeling Mutual Context of Object and Human Pose in Human-object Interaction Activities

Bangpeng Yao
Li Fei-Fei

Presented by Sahil Shah

SLIDE 2

Agenda

Introduction
Problem Formulation
Learning
Inference
Results

SLIDE 3

Agenda

Introduction
Problem Formulation
Learning
Inference
Results

SLIDE 4

Introduction

Note on author

– Pioneer of ImageNet dataset – Must see TED talk in March 2015

SLIDE 5

Introduction

Problem: Detecting objects in cluttered

scenes and estimating articulated human body parts especially in human object interaction activities

SLIDE 6

Introduction

SLIDE 7

Introduction

SLIDE 8

Introduction

Key insight: Mutual Context

– Automatically discover relevant poses – Automatically discover spatial relationships – Optimize for mutual co-occurrence of object and pose

SLIDE 9

Introduction

Contribution

– Builds up on Prof. Gupta’s work – First to use mutual context – Jointly solve object detection & pose estimation

SLIDE 10

Agenda

Introduction
Problem Formulation
Learning
Inference
Results

SLIDE 11

Problem Formulation

Goal: Given an image of HOI activity we

need to estimate human pose(H), detect the

bject(O) and classify HOI activity(A)
Model

– Hierarchical Random Field – A,O and H contribute to detection of each other – H is a hidden variable – Body parts {Pn} are found using feature based detectors and they compose to form H

SLIDE 12

Problem Formulation

Golf ¡Swing ¡ Tennis ¡Forehand ¡

SLIDE 13

Problem Formulation

SLIDE 14

Problem Formulation

Why need to learn structure?

– The model captures important connections between object and the body parts – Which parts of the body should be connected to

verall pose (H) and object (O)?

SLIDE 15

Problem Formulation

Model

– Overall model: Ψ = ∑ 𝑥𝑓𝜔𝑓 – A,O,H: 𝜔𝑓(𝐵, 𝑃), 𝜔𝑓(𝐵, 𝐼), and 𝜔𝑓(𝑃, 𝐼)

Counting co-occurrence frequencies

– Spatial Relationships: 𝜔𝑓(𝑃,𝑄𝑜) & 𝜔𝑓 (𝑄𝑛,𝑄𝑜)

bin(l𝑃 −l𝑄𝑜)⋅bin(𝜄𝑃 −𝜄𝑄𝑜)⋅𝒪(𝑡𝑃/𝑡𝑄𝑜)

– Compatibility: 𝜔𝑓(𝐼,𝑄𝑜)

bin(l𝑄𝑜 −l𝑄1)⋅bin(𝜄𝑄𝑜)⋅𝒪(𝑡𝑄𝑜)

– Object & Body parts: 𝜔𝑓(𝑃,𝑔𝑃) and 𝜔𝑓(𝑄𝑜,𝑔𝑄𝑜)

Shape context feature based detectors

SLIDE 16

Agenda

Introduction
Problem Formulation
Learning
Inference
Results

SLIDE 17

Learning

Input and Output

Images with labeled

bjects, body parts &

HOI Model Learning Set of models- each for

ne human pose in a

particular HOI activity

SLIDE 18

Learning

Overall Algorithm

SLIDE 19

Learning

Hill climbing structure learning

– Each pose in each HOI activity class – Add/remove an edge and check for optima – Keep tabu list to avoid revisiting solutions – Randomly initialize thrice to avoid local optimas

SLIDE 20

Learning

Max-margin for parameter estimation

– Maximize discrimination between different A – Each A has subclasses, hence multiple models and multiple weight vectors – Training sample: (x𝑗, 𝑑𝑗, 𝑧(𝑑𝑗)) 𝑧: maps 𝑑𝑗 to class label – F: 𝑧(F(x𝑗)) = 𝑧(𝑑𝑗) F(x𝑗) = argmax𝑠{w𝑠⋅x𝑗} wr: weights for rth sub- class.

SLIDE 21

Learning

Overall Algorithm

SLIDE 22

Agenda

Introduction
Problem Formulation
Learning
Inference
Results

SLIDE 23

Inference

Given a test image(I), estimate pose and

detect object and classify activity

– To detect object (O) we maximize likelihood of the models given that object. Denoted as max𝑃,𝐼 Ψ(𝐵𝑙, 𝑃, 𝐼, I) – To detect human pose (H), compute max𝑃,𝐼 Ψ(𝐵𝑙, 𝑃, 𝐼, I) for each Ak and select the one corresponding to the ML score

SLIDE 24

Inference

SLIDE 25

Agenda

Introduction
Problem Formulation
Learning
Inference
Results

SLIDE 26

Results

SLIDE 27

Results

SLIDE 28

Results

Object Detection

– Compare with two experiments

1. Sliding window as baseline
2. Pedestrian detector for human’s location context

SLIDE 29

Results

SLIDE 30

Results

Pose Estimation

SLIDE 31

Results

HOI classification

– Compare with SVM with BoW – Compare with Gupta et. al.

SLIDE 32

Results

Upper-left → object detection by mutual context
Lower-left → object detection by a scanning window
Upper-right → pose estimation by mutual context
Lower-right → pose estimation by the state-of-the-art pictorial structure method

SLIDE 33

Results

Upper-left → object detection by mutual context
Lower-left → object detection by a scanning window
Upper-right → pose estimation by mutual context
Lower-right → pose estimation by the state-of-the-art pictorial structure method

SLIDE 34

Modeling Mutual Context of Object and Human Pose in Human-object Interaction Activities

Presented by Sahil Shah

Agenda

Agenda

Introduction

– Pioneer of ImageNet dataset – Must see TED talk in March 2015

Introduction

scenes and estimating articulated human body parts especially in human object interaction activities

Introduction

Introduction

Introduction

– Automatically discover relevant poses – Automatically discover spatial relationships – Optimize for mutual co-occurrence of object and pose

Introduction

– Builds up on Prof. Gupta’s work – First to use mutual context – Jointly solve object detection & pose estimation

Agenda

Problem Formulation

need to estimate human pose(H), detect the

– Hierarchical Random Field – A,O and H contribute to detection of each other – H is a hidden variable – Body parts {Pn} are found using feature based detectors and they compose to form H

Problem Formulation

Problem Formulation

Problem Formulation

– The model captures important connections between object and the body parts – Which parts of the body should be connected to

Problem Formulation

– Overall model: Ψ = ∑ 𝑥𝑓𝜔𝑓 – A,O,H: 𝜔𝑓(𝐵, 𝑃), 𝜔𝑓(𝐵, 𝐼), and 𝜔𝑓(𝑃, 𝐼)

– Spatial Relationships: 𝜔𝑓(𝑃,𝑄𝑜) & 𝜔𝑓 (𝑄𝑛,𝑄𝑜)

– Compatibility: 𝜔𝑓(𝐼,𝑄𝑜)

– Object & Body parts: 𝜔𝑓(𝑃,𝑔𝑃) and 𝜔𝑓(𝑄𝑜,𝑔𝑄𝑜)

Agenda

Learning

Learning

Learning

– Each pose in each HOI activity class – Add/remove an edge and check for optima – Keep tabu list to avoid revisiting solutions – Randomly initialize thrice to avoid local optimas

Learning

Learning

Agenda

Inference

detect object and classify activity

– To detect object (O) we maximize likelihood of the models given that object. Denoted as max𝑃,𝐼 Ψ(𝐵𝑙, 𝑃, 𝐼, I) – To detect human pose (H), compute max𝑃,𝐼 Ψ(𝐵𝑙, 𝑃, 𝐼, I) for each Ak and select the one corresponding to the ML score

Inference

Agenda

Results

Results

Results

– Compare with two experiments

Results

Results

Results

– Compare with SVM with BoW – Compare with Gupta et. al.

Results

Results

Thank you!