DYNAMIC FACIAL ANALYSIS: FROM BAYESIAN FILTERING TO RNN Jinwei Gu, - - PowerPoint PPT Presentation

β–Ά
dynamic facial analysis from bayesian filtering to rnn
SMART_READER_LITE
LIVE PREVIEW

DYNAMIC FACIAL ANALYSIS: FROM BAYESIAN FILTERING TO RNN Jinwei Gu, - - PowerPoint PPT Presentation

DYNAMIC FACIAL ANALYSIS: FROM BAYESIAN FILTERING TO RNN Jinwei Gu, 2017/4/18 with Xiaodong Yang, Shalini De Mello, and Jan Kautz FACIAL ANALYSIS IN VIDEOS Exploit temporal coherence to track facial features in videos Head/Face Tracking


slide-1
SLIDE 1

Jinwei Gu, 2017/4/18

DYNAMIC FACIAL ANALYSIS: FROM BAYESIAN FILTERING TO RNN

with Xiaodong Yang, Shalini De Mello, and Jan Kautz

slide-2
SLIDE 2

2

FACIAL ANALYSIS IN VIDEOS

Exploit temporal coherence to track facial features in videos

Performance 3D Capture Head/Face Tracking HyperFace, 2016 DeepHeadPose, 2015 HeadPoseFromDepth, 2015

slide-3
SLIDE 3

3

CLASSICAL APPROACH: BAYESIAN FILTERING

It is challenging to design Bayesian filters specific for each task!

Spatial-Temporal RNN Face Landmark [ECCV2016] Tree-based DPM Face Landmark Tracking [ICCV2015] Particle Filters Head Pose Tracking [2010]

slide-4
SLIDE 4

4

FROM BAYESIAN FILTERING TO RNN

Use RNN to avoid tracker-engineering

π’π‘’βˆ’1 𝐒𝑒 𝐲𝑒 π²π‘’βˆ’1 Bayesian Filter 𝐳𝑒 π³π‘’βˆ’1 π’π‘’βˆ’1 𝐒𝑒 𝐲𝑒 π²π‘’βˆ’1 RNN (unfolded) 𝐳𝑒 π³π‘’βˆ’1

Input (Measurement) Output (Target) Hidden State

slide-5
SLIDE 5

5

FROM BAYESIAN FILTERING TO RNN

Use RNN to avoid tracker-engineering

slide-6
SLIDE 6

6

AN EXAMPLE: KALMAN FILTERS VS. RNN

𝑦𝑒 = π‘‹π‘¦π‘’βˆ’1 + π‘œ1 𝑧𝑒 = π‘Šπ‘¦π‘’ + π‘œ2

Linear Kalman Filter state transition (process model) process noise measurement model measurement noise

𝑧𝑒 = 𝜏2(π‘Šβ„Žπ‘’ + 𝑐2) β„Žπ‘’ = 𝜏1(π‘‹β„Žπ‘’βˆ’1 + 𝑉𝑦𝑒 + 𝑐1)

Simple RNN (i.e., vanilla RNN) noisy input target

  • utput

noisy

  • bservation

estimated state

slide-7
SLIDE 7

7

AN EXAMPLE: KALMAN FILTERS VS. RNN

β„Žπ‘’ = 𝜏1(π‘‹β„Žπ‘’βˆ’1 + 𝑉𝑦𝑒 + 𝑐1) 𝑧𝑒 = 𝜏2(π‘Šβ„Žπ‘’ + 𝑐2)

Simple RNN (i.e., vanilla RNN) noisy input target

  • utput

𝑦𝑒 = π‘‹π‘¦π‘’βˆ’1 + 𝐿𝑒(𝑧𝑒 βˆ’ π‘Šπ‘¦π‘’βˆ’1)

Kalman Gain

𝑦𝑒 = (𝑋 βˆ’πΏπ‘’π‘Š)π‘¦π‘’βˆ’1 +𝐿𝑒𝑧𝑒

Linear Kalman Filter noisy Input target

  • utput

𝑨𝑒 = π‘Šπ‘¦π‘’

slide-8
SLIDE 8

8

AN EXAMPLE: KALMAN FILTERS VS. RNN

Linear Kalman Filter

𝑦𝑒 = π‘‹π‘¦π‘’βˆ’1 + 𝐿𝑒(𝑧𝑒 βˆ’ π‘Šπ‘¦π‘’βˆ’1)

Kalman Gain

𝑦𝑒 = (𝑋 βˆ’πΏπ‘’π‘Š)π‘¦π‘’βˆ’1 +𝐿𝑒𝑧𝑒

noisy Input target

  • utput

𝑨𝑒 = π‘Šπ‘¦π‘’

Simple RNN (i.e., vanilla RNN): assume linear activation & no bias

𝑨𝑒 = π‘Šπ‘¦π‘’ 𝑦𝑒 = π‘‹π‘¦π‘’βˆ’1 + 𝑉𝑧𝑙

noisy Input target

  • utput
slide-9
SLIDE 9

9

A TOY EXAMPLE: TRACKING A MOVING CURSOR

Input: a noisy curve y(t) state: [x, x’, x’’]

𝑦𝑒 = (𝑋 βˆ’πΏπ‘’π‘Š)π‘¦π‘’βˆ’1 +𝐿𝑒𝑧𝑒 𝑨𝑒 = π‘Šπ‘¦π‘’

Kalman Filter:

𝑦𝑒 = π‘€π‘‡π‘ˆπ‘(π‘¦π‘’βˆ’1, 𝑧𝑒) 𝑨𝑒 = π‘Šπ‘¦π‘’

LSTM:

slide-10
SLIDE 10

10

FACIAL ANLYSIS IN VIDEOS WITH RNN

Variants of RNN: FC-RNN*, LSTM, GRU

slide-11
SLIDE 11

11

HEAD POSE FROM VIDEOS

Results on BIWI dataset

slide-12
SLIDE 12

12

HEAD POSE FROM VIDEOS

Input RNN (Ours) Per-Frame + KF

slide-13
SLIDE 13

13

LARGE SYNTHETIC DATASET MATTERS!

The SynHead Dataset

10 high-quality 3D scans of head models 51,096 head poses from 70 motion tracks 510,960 RGB images in total Accurate head pose and landmark annotations (2D/3D) Available at: https://research.nvidia.com (BIWI Dataset: 24 videos and 15,678 frames in total)

slide-14
SLIDE 14

14

LARGE SYNTHETIC DATASET MATTERS!

The SynHead Database

slide-15
SLIDE 15

15

FACIAL LANDMARKS FROM VIDEO

Ground Truth Estimated HyperFace Per-Frame RNN (Ours)

slide-16
SLIDE 16

16

MORE EXAMPLES

slide-17
SLIDE 17

17

VARIANTS OF RNN FOR LANDMARK ESTIMATION

FC-RNN FC-LSTM FC-GRU fc6

0.7567, 0.10 0.7690, 0.13 0.7715, 0.15

fc7

0.7424, 0.06 0.7539, 0.06 0.7554, 0.36

fc6+fc7

0.7630, 0.28 0.7456, 0.27 0.7605, 0.19

(Latest results)

slide-18
SLIDE 18

18

CO-PILOT DEMO IN THE CES KEYNOTE

(together with GazeNet by Shalini et.al.)

slide-19
SLIDE 19

19

DYNAMIC FACIAL ANALYSIS:

  • RNNs can be views as a variant of Bayesian Filters
  • A general framework to leverage temporal coherence in videos
  • Large synthetic datasets improve the performance

From Bayesian Filtering to RNN

The SynHead Dataset Available at: https://research.nvidia.com