Linking People in Videos with Their Names Using Coreference Resolution Vignesh Ramanathan, Armand Joulin, Percy Liang, and Li Fei-Fei Stanford University Images from Ramanathan et al. (2014) Yukun Zhu CSC2523 1 / 17
Task Missy points to the larger kid. The big kid walks off. Other kids jeer. No labelled instance. Script is the only source of supervision Names include nominal expressions and pronouns Yukun Zhu CSC2523 2 / 17
Previous Approach On person naming: Multiple instance learning, using proper names from script Treat videos and scripts as bag of face tracks and names Unidirectional information flow from text to vision Yukun Zhu CSC2523 3 / 17
Previous Approach On person naming: Multiple instance learning, using proper names from script Treat videos and scripts as bag of face tracks and names Unidirectional information flow from text to vision On coreference resolution: One of the core task in NLP Can operate on language alone Not accurate enough Yukun Zhu CSC2523 3 / 17
Previous Approach On person naming: Multiple instance learning, using proper names from script Treat videos and scripts as bag of face tracks and names Unidirectional information flow from text to vision On coreference resolution: One of the core task in NLP Can operate on language alone Not accurate enough Yukun Zhu CSC2523 3 / 17
Problem Setup Input: Yukun Zhu CSC2523 4 / 17
Problem Setup Input: Videos with detected human tracks Yukun Zhu CSC2523 4 / 17
Problem Setup Input: Videos with detected human tracks Script roughly aligned with video segments Yukun Zhu CSC2523 4 / 17
Problem Setup Input: Videos with detected human tracks Script roughly aligned with video segments Names (including pronoun/nominals) from script Yukun Zhu CSC2523 4 / 17
Problem Setup Input: Videos with detected human tracks Script roughly aligned with video segments Names (including pronoun/nominals) from script Cast names Yukun Zhu CSC2523 4 / 17
Problem Setup Output: Yukun Zhu CSC2523 5 / 17
Problem Setup Output: Name assignment to human tracks in video Yukun Zhu CSC2523 5 / 17
Problem Setup Output: Name assignment to human tracks in video Name assignment to human mentions in text Yukun Zhu CSC2523 5 / 17
Proposed Method C = γ t C track + γ m C mention + C align Yukun Zhu CSC2523 6 / 17
Proposed Method C = γ t C track ( Y ) + γ m C mention ( Z , R ) + C align ( A , Y , Z ) Name-Track assignment Y ∈ { 0 , 1 } T × P Name-Mention assignment Z ∈ { 0 , 1 } M × P Antecedent matrix R ∈ { 0 , 1 } M × M Alignment matrix A ∈ { 0 , 1 } T × M Yukun Zhu CSC2523 7 / 17
C track ( Y ) Cost of assigning names to tracks Based on video features only Formulate cost function of regression based clustering � || Y − XW || 2 F + λ || W || 2 C ( Y ; X , λ ) = arg min F W t ∈ τ = tr ( Y T Π( X , λ ) Y ) Constraints: Each track is assigned to exactly one name Speaker should be aligned to at least one track Name not mentioned in a scene won’t be aligned Yukun Zhu CSC2523 8 / 17
C mention ( Z , R ) Depends on text only Proper mentions(68%) are trivial to map Pronouns/Nominals alone are not informative Apply regression based clustering to predict R Constraints: Each mention has at most one antecedent Each mention is assigned to one name Gender consistency/no self-association of pronouns Connection constraint R m , m ′ = 1 → Z m = Z m ′ Yukun Zhu CSC2523 9 / 17
C align ( A , Y , Z ) Intuition Aligned track/mention should be assigned to the same name Tracks and mentions are ordered sequence through time Tracks and mentions are roughly aligned in time Formulation Soft connection penalty min || A T Y − Z || 2 F Monotonic constraint Mention mapping constraint Yukun Zhu CSC2523 10 / 17
Optimization min γ t C track ( Y )+ γ m C mention ( Z , R ) + C align ( A , Y , Z ) s . t . Y ∈ C Y , Z , R ∈ C Z , R , A ∈ C A Relax Y , R , Z to be [0 , 1] Slack constraints of Y , Z Block coordinate descent Yukun Zhu CSC2523 11 / 17
Optimization min γ t C track ( Y )+ γ m C mention ( Z , R ) + C align ( A , Y , Z ) s . t . Y ∈ C Y , Z , R ∈ C Z , R , A ∈ C A Relax Y , R , Z to be [0 , 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y Yukun Zhu CSC2523 11 / 17
Optimization min γ t C track ( Y )+ γ m C mention ( Z , R ) + C align ( A , Y , Z ) s . t . Y ∈ C Y , Z , R ∈ C Z , R , A ∈ C A Relax Y , R , Z to be [0 , 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y Quadratic programming to optimize Z , R Yukun Zhu CSC2523 11 / 17
Optimization min γ t C track ( Y )+ γ m C mention ( Z , R ) + C align ( A , Y , Z ) s . t . Y ∈ C Y , Z , R ∈ C Z , R , A ∈ C A Relax Y , R , Z to be [0 , 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y Quadratic programming to optimize Z , R Dynamic time wrapping to optimize A Yukun Zhu CSC2523 11 / 17
Optimization min γ t C track ( Y )+ γ m C mention ( Z , R ) + C align ( A , Y , Z ) s . t . Y ∈ C Y , Z , R ∈ C Z , R , A ∈ C A Relax Y , R , Z to be [0 , 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y Quadratic programming to optimize Z , R Dynamic time wrapping to optimize A Round Y , Z to integer matrix Yukun Zhu CSC2523 11 / 17
Dataset Yukun Zhu CSC2523 12 / 17
Quantitative Results Name assignment to tracks in video. Random: Randomly picks a name based on crude alignment Cour: Weakly-supervised method for name assignment BOJ: min C track without scene constraint OurUnidir: min C track with scene constraint OurUnicor: min C track with coreference constraints OurUnif: All tracks given equal values in alignment matrix OurBidir: Full model Yukun Zhu CSC2523 13 / 17
Quantitative Results Name assignment to mentions in text. Yukun Zhu CSC2523 14 / 17
Qualitative Results Yukun Zhu CSC2523 15 / 17
Errors Missing/low resolution faces Error in coreference resolution Yukun Zhu CSC2523 16 / 17
Summary Contribution: Joint person naming and coreference resolution New dataset State-of-the-art performance on visual/textual side Yukun Zhu CSC2523 17 / 17
Summary Contribution: Joint person naming and coreference resolution New dataset State-of-the-art performance on visual/textual side Future work: Actions/attributes for alignment Yukun Zhu CSC2523 17 / 17
V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking People in Videos with “Their” Names Using Coreference Resolution. In Computer Vision – ECCV 2014 , pages 95–110. Springer International Publishing, Cham, Sept. 2014. Yukun Zhu CSC2523 17 / 17
Recommend
More recommend