CNN^2: Viewpoint Generalization via a Binocular Vision Wei-Da Chen and Shan-Hung Wu CS Department, National Tsing-Hua University Taiwan, R.O.C. wdchen@datalab.cs.nthu.edu.tw, shwu@cs.nthu.edu.tw
On Generalizability of CNNs • The Convolutional Neural Networks (CNNs) have laid the foundation for many techniques in various applications • However, the 3D viewpoint generalizability of CNNs is still far behind human’s visual capabilities 2
3D Viewpoint Generalizability Train Test • Humans can recognize objects at unseen angles • But CNNs cannot 3
Outline • Related work • CNN^2 • Dual feedforward pathways • Dual parallax augmentation • Concentric Multiscale (CM) pooling • Experiments 4
Voxel-Reconstruction Methods • E.g., the Perspective Transformer Networks (PTNs) by Yan et al. 16 • Learn 3D models directly 5
Cons • Require either • Voxel-level supervision, or • Omnidirectional images as input • Both are expensive to collect in practice 6
CapsuleNets (Hinton et al. 17, 18) • Different capsules are organized in a parse tree where lower-level capsules are dynamically routed to upper- level capsules using an agreement protocol • When viewpoint changes, the “routes” will change in a coordinate way 7
But… • People found that CapsuleNets are hard to train • Capsules increase the number of model parameters • Iterative routing-by-agreement algorithm is time- consuming • Does not ensure the emergence of a correct parse tree (Peer et al. 18) • Not compatible with CNNs • and therefore cannot benefit the rich CNN ecosystem 8
Outline • Related work • CNN^2 • Dual feedforward pathways • Dual parallax augmentation • Concentric Multiscale (CM) pooling • Experiments 9
Our Goals • A new model that • has improved 3D viewpoint generalizability • does not require expensive input and supervision • is CNN compatible 10
Ob Observation: Hu Huma mans under erstand the e world using g tw two eyes! 11
Binocular Images • Today, binocular images can be easily collected • Majority of people are using their smartphones, which are now usually equipped with dual or more lens • One can also extract two nearby frames in online videos to construct a large binocular image dataset 12
Binocular Solution 1 (LeCun et al. 14) merge Classifier Pooling Pooling Pooling Conv Conv Conv • Stacks up two binocular images along the channel dimension and then feeds them to a regular CNN • But don’t model any prior of binocular vision 13
Binocular Solution 2: Sol. 1 + Monodepth (Godard et al. 17) • Calculate the depth map explicitly, then add it as additional input channels 14
However… • The depth information is only a subset of the knowledge that can be learned from binocular vision • Studies in neuroscience have found out that human’s visual system can detect • Stereoscopic edges (Von Der Heydt et al. 00) • Foreground and background (Qiu andVon Der Heydt 05; Maruko et al. 08) • Illusory contours of objects (Von der Heydt et al. 1984; Anzai et al. 07) 15
• Concentric Multiscale (CM) pooling • Dual parallax augmentation • Dual feedforward pathways Our Solution: CNN^2 augment CM Pooling CM Pooling Conv Conv augment augment CM Pooling CM Pooling Conv Conv augment augment CM Pooling CM Pooling Conv Conv add Classifier 16
Outline • Related work • CNN^2 • Dual feedforward pathways • Dual parallax augmentation • Concentric Multiscale (CM) pooling • Experiments 17
Dual Feedforward Pathways Optic Nerve CM Pooling CM Pooling CM Pooling augment augment Conv Conv Conv Optic Chiasm augment Classifier add Lateral Geniculate Nucleus (LGN) CM Pooling CM Pooling CM Pooling Conv Conv Conv augment augment Visual Cortex System • Humans visual system at left and right sides of the brain are known to have bias (Gotts et al. 13) • Filters/kernels in the left and right pathways can learn different (biased) features 18
Outline • Related work • CNN^2 • Dual feedforward pathways • Dual parallax augmentation • Concentric Multiscale (CM) pooling • Experiments CM Pooling CM Pooling CM Pooling augment augment Conv Conv Conv augment Classifier add CM Pooling CM Pooling CM Pooling Conv Conv Conv augment augment 19
<latexit sha1_base64="1QbcNEVtgJMj3gvVYG2N5kn6JlQ=">ACAnicbVDLSsNAFJ3UV62vqCIm8EiuApJi9TuSt24bMU+oA1hMpm0QycPZiZCcWNv+JGUBG3Lv0Cd278FietiFoPDHM4517uvceNGRXSN+13MLi0vJKfrWwtr6xuaVv7RFlHBMWjhiEe+6SBGQ9KSVDLSjTlBgctIx2dZX7ninBo/BSjmNiB2gQUp9iJXk6Pt9SZlH0r4bMU+MA/Wlw8nEuSg4etE0quVq2arAjJwoQMswp/gmxdpe84Pe18bjv7W9yKcBCSUmCEhepYZSztFXFLMyKTQTwSJER6hAekpGqKACDudnjCBR0rxoB9x9UIJp+rPjhQFIltPVQZIDsVfLxP/83qJ9E/tlIZxIkmIZ4P8hEZwSwP6FOsGRjRDmVO0K8RBxhKVKLQth7uR50i4ZVtkoNa1irQ5myIMDcAiOgQUqoAbOQO0AbX4BY8gEftRrvTnrTnWlO+rZBb+gvXwCJIabnA=</latexit> <latexit sha1_base64="YV+Kac5N12c5cD+YnxbVjl1Tmlk=">ACAnicbVDLSsNAFJ3UV62vqCIm8EiuApJi9TuSt24cNGCfUAbwmQyaYdOHsxMhBKG3/FjaAibl36Be7c+C1OWhG1HhjmcM693HuPGzMqpGm+a7mFxaXlfxqYW19Y3NL395piyjhmLRwxCLedZEgjIakJalkpBtzgKXkY47Osv8zhXhgkbhpRzHxA7QIKQ+xUgqydH3+5Iyj6R9N2KeGAfqS4eTiXNRcPSiaVTL1bJVgRk5UYCWYU7xTYq1veYHva+/Nhz9re9FOAlIKDFDQvQsM5Z2irikmJFJoZ8IEiM8QgPSUzREARF2Oj1hAo+U4kE/4uqFEk7Vnx0pCkS2nqoMkByKv14m/uf1Eumf2ikN40SEM8G+QmDMoJZHtCjnGDJxogzKnaFeIh4ghLlVoWwtzJ86RdMqyUWpaxVodzJAHB+AQHAMLVEANnIMGaAEMrsEteACP2o12pz1pz7PSnPbVswt+QXv5Btom5Y=</latexit> Dual Parallax Augmentation (1/2) Left path: W X H X C W X H X C - ) = W X H X 2C ( W X H X C concat h L h R ˜ h L h L Right path: W X H X C W X H X C - ) = ( W X H X 2C W X H X C concat h R h L ˜ h R h R 20
Dual Parallax Augmentation (2/2) • Allows the filters/kernels in convolutional layers to recursively detect stereoscopic features at different abstraction levels by looking into the parallax • The small differences between the two input images at the pixel level and at shallow layers may add up to a big difference at a deeper layer 21
Outline • Related work • CNN^2 • Dual feedforward pathways • Dual parallax augmentation • Concentric Multiscale (CM) pooling • Experiments CM Pooling CM Pooling CM Pooling augment augment Conv Conv Conv augment Classifier add CM Pooling CM Pooling CM Pooling Conv Conv Conv augment augment 22
Concentric Multiscale (CM) Pooling (1/2) • Areas that are out of focus are blurred 23
Concentric Multiscale (CM) Pooling (2/2) Scan Avg. Pool 24
Before Convolution Placed Bef • Allows filters/kernels to contrast blurry features with clear features 25
Outline • Related work • CNN^2 • Dual feedforward pathways • Dual parallax augmentation • Concentric Multiscale (CM) pooling • Experiments 26
Datasets • ModelNet2D (gray scale) • SmallNORB (gray scale) • RGBD-Object (RGB) 27
Train/Test Setting 28
3D Viewpoint Generalization 29
Learning Efficiency 30
Backward Compatibility • CNN^2, by default, does not generalize to 2D rotated images • But can be enhanced by existing works on 2D rotation generalizability 31
Takwaways • We propose CNN^2 that • gives improved 3D viewpoint generalizability • does not require expensive input or supervision • is compatible with CNNs and can benefit the rich CNN ecosystem • Detects stereoscopic features beyond depth via: • Dual feedforward pathways • Dual parallax augmentation • Concentric Multiscale (CM) pooling from binocular images 32
Recommend
More recommend