Tow ards Bridging Bottom -Up & Top-Dow n Vision w ith Hierarchical Com positional Models UC Irvine UC Irvine Iasonas Kokkinos Iasonas Kokkinos Center for Image and Vision Sciences UCLA Joint work with Alan Yuille
High-Level Vision Goals • Given an image – Decide if it contains a car – Find its location Find its location – Find its extent – Find its structures
Two Main Approaches to Vision • Bottom-up • Top-down – Data Driven – Model Driven – Feature Extraction – Parameter Estimation – Pattern recognition – Analysis-by-Synthesis
Motivation – Vision problems have both low- and high- level aspects D. Mumford, Pattern Theory, 1995 – Synergy: joint treatment improves performance Synergy: joint treatment improves performance – Combined bottom-up and top-down processing
Talk Outline • Motivation M ti ti • Deform ations and Contours • Object Parsing • Object Parsing • Appearance Information • Conclusions • Conclusions
Top-Down: Object Models • • Deformable Models Deformable Models S( X) S( X) X X • Active Appearance Models pp
Joint Segmentation and Recognition • EM formulation – E-step: segmentation – M-step: deformable model fitting M-step E-step • AAM-based segmentation • Segmentation-based detection Kokkinos & Maragos, PAMI 2008
Learning Deformation Models AAM Learning: g E: Deform M: Update s Input Images Edges & Ridges T S AAM Fit Training Set Deformation modes Kokkinos and Yuille, ICCV 2007
Bottom-Up: Contour-Based Image Description • P i Primal Sketch Contours: Edges and Ridges l Sk t h C t Ed d Rid Sketch Contours Edge Tokens Ridge Tokens – Geometry & semantics
Talk Outline • Motivation M ti ti • Contours, Deformations and Hierarchy • Object Parsing • Object Parsing • Appearance Information • Conclusions • Conclusions
Hierarchical Compositional Models Object Object Object Parts Parts Parts Parts Contours Contours Tokens Tokens • Top-down view: object generates tokens • Bottom-up view: object is composed from tokens Bottom up view: object is composed from tokens
Inference for Structured Models • Graphical Models ( Bayesian Netw orks/ MRFs) – Encode random variable dependencies with a graph. – High-Level Vision High Level Vision • Random variables: part poses (e.g. location, orientation, scale) • Dependencies: kinematic constraints D d i ki ti t i t • Belief Propagation – Graph nodes ` inform’ each other by sending messages. – Converges after 2 passes through the graph.
Exploiting the Particular Setting – Sparse Image Representation • Bottom-up cues guide the search for objects. • No need to consider all node states as in BP – Hierarchical Object Representation • Quickly rule out unpromising solutions • Coarse-to-Fine detection
Compositional Detection • View production rules as composition rules • Build a parse tree for the object • Requires – Composition rules C i i l – Prioritized search
Composition of the ` Back’ Structure
Composing Structures • How can we compose complex structures? H l t t ? – Gestalt rules (parallelism, similarity..) • How will we compose this? p ? • How will we compose learned structures?
Canonical Rule Formulation • Combine structure with one constituent at a time. C bi t t ith tit t t ti • Mechanical construction of composition rules • At most binary rules • At most binary rules • Derivation cost: minus log-likelihood of observations
Composition as Climbing a Lattice • Introduce vector indicating instantiated substructures – partial ordering among structures • Hasse Diagram for 3-partite structure g p 1 1 1 3 1 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 2 0 1 0 0 0 0 – By acquiring a substructure, the structure climbs upwards
Composition of the ` Back’ Structure Problem: Too many options! (Combinatorial explosion)
Analogy: Building a puzzle • Bottom-Up solution: Combine pieces until you build the car p p y – Does not exploit the box’ cover • Top-Down solution: Try fitting each piece to the box’ cover. – Most pieces are uniform/irrelevant • Bottom-Up/Top-Down solution: – Form car-like structures, but use cover to suggest combinations. F lik t t b t t t bi ti
Best First Search • Dijkstra’s Algorithm Dijk t ’ Al ith – Prioritize based on ` cost so far’ – For parsing: Knuth’s Lightest Derivation For parsing: Knuth s Lightest Derivation • A* Search – Consider ` cost to go’ – Approximate with heuristic cost Cost so far Exit Cost to go Cost to go Heuristic cost Entry
` Cost to go’ for Parsing • The Generalized A* Architecture, Felzenszwalb & McAllester • Context: complement needed to get to the goal. • • Recursive derivation of contexts Recursive derivation of contexts.
Heuristics for Parsing: Context Abstractions • A* requires lower bound of derivation cost A* i l b d f d i ti t • Derive context in coarser domain ( abstraction ) – Lower bound cost on fine domain – Lower bound cost on fine domain • Use it to prioritize search KLD: A* :
Abstractions via Structure Coarsening • Coarsening: identify nodes of Hasse diagram • Coarsening: identify nodes of Hasse diagram 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 Coarsen 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 part suffices • Lower bound composition cost
Coarse Level Parsing Bottom-Up KLD: Coarse Domain Contexts to Fine Level Top-Down
Fine Level Parsing Top-Down Guidance: Heuristic, Coarse Level Bottom-Up Composition, Fine level
A* versus Best First Parsing • A* Parsing A* P i Front Part Middle Part Back Part Object Goal Coarse Level Fine Level • KLD Parsing
Parsing & Localization Results - I
Parsing & Localization Results - II
Parsing & Localization Results - III 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Apples Bottles 1 1 1 1 0.9 0.9 0.8 0.8 etection rate tection rate 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 Dete 0.4 0.4 Dete 0.3 0.3 Contour Segment Networks Contour Segment Networks 0.2 0.2 Our method − Berkeley Edges Our method − Berkeley Edges 0.1 Our method − Lindeberg Edges 0.1 Our method − Lindeberg Edges 0 0 0 0.25 0.5 0.75 1 1.25 1.5 0 0.25 0.5 0.75 1 1.25 1.5 False−positives per image False−positives per image
UIUC Benchmark Results • 170 Images, heavy clutter • 170 Images heavy clutter – KLD: typically ~ 10 seconds – A* Search: ~ 1-2 seconds A Search: 1 2 seconds 1 Comparison with prior work 0.9 0.9 0.8 0.7 Recall all 0.6 0.6 0.5 R 0.4 0.3 Our method Leibe et. al. 0.2 Fergus et. al. 0.1 Agarwal and Roth Agarwal and Roth 0 0.6 0.7 0.8 0.9 1 Precision
Talk Outline • Motivation M ti ti • Contours, Deformations and Hierarchy • Object Parsing • Object Parsing • Appearance I nform ation • Conclusions • Conclusions
Are we missing something? • Appearance information • Appearance information • Main challenge: scale invariance for edges • Main challenge: scale invariance for edges – Edges are intrinsically 1-D features
Scale Invariance without Scale Selection • Log-Polar sampling & spatially varying filtering Log Polar sampling & spatially varying filtering Scale Space – Turns scalings/ rotations into translations. • Fourier Transform Modulus: translation invariance Kokkinos and Yuille, CVPR 2008
Descriptor Performance
Talk Outline • Motivation M ti ti • Contours, Deformations and Hierarchy • Object Parsing • Object Parsing • Appearance Information • Conclusions • Conclusions
Contributions • A* Search framework for Object Parsing – Bottom-Up information: production cost Bottom Up information: production cost – Top-Down information: heuristic function • Composition Rules – Canonical Rule Formulation / Hasse Diagrams – Integral Angles (not covered) • Heuristics for Parsing f – Structure Coarsening
Future Research – Compositional Approach • Learning Structures and Hierarchies • Parsing and Learning with Alternative Structures (ORs) • Reusable Parts Multiple Class Recognition • Reusable Parts, Multiple Class Recognition – Revisit Low- and Mid- level vision problems • Segmentation • Boundary detection • Perceptual grouping Perceptual grouping – Scene parsing Sce e pa s g
Recommend
More recommend