Talk @Munich October 11, 2017 Beyond detection: GANs and LSTMs to pay attention at human presence Ri Rita Cucchia iara Imag Imagelab, , Di Dipartimento di di Ing Ingegneria «E «Enzo Ferrari» University of Modena e Reggio Emilia, Italy
Agenda Beyond Human detection: 1) See humans 2) See what humans see Use of GANs, Iterative and Recurrent neural architectures in Vision
Beyond (People) detection ✓ 10 10 year ears pe pedestr trian de detectio ion [S. Zhang, R. Benenson, M. Omran, J. Hosang, B. Schiele CVPR2016] , about 70% accuracy on Caltech ✓ Many dee ** FAST CFM [ Hu, Wang, Shen, van den Hengel, Porikli IEEE TCSVT 2017] deep ne netw tworks for or pe pedestr trian de detectio ion : CNNS+ handcraft feature: 9% miss rate on Caltech reasonable dataset * ✓ Ob Object de detector or: SSD**, YOLO, YOLOv2***.. YOLOv2 78.6% mAP on VOC2007-12 at 40fps **SSD [W.Liu at al SSD Single Shot MultiBox Detector 2017] Still a margin of improvement.. ***YOLOv2 [Redmon Farhadi ArXiv 2017]
Standard networks are not enough. Embedded vision solutions with bckg sub and CNNs CHALLENGES IN NEW ENVIRONMENTS Real-time detection of people and AGVs in working areas on embedded NVIDIA boards at Imagelab
If People Detection Solved.. thanks to PANASONIC GANS FOR UNDERSTANDING HUMAN PRESENCE UNDER EXTREME CONDITIONS (thanks to Matteo Fabbri, and Simone Calderara)
Male Attribute Classification Jacket Black hair Backpack Object Plastic Long Trousers bag Now CNNs can classify more than 50 attributes Low-Resolution Problems with • Low resolution • Occlusions and self-occlusions Occlusion
Generative Adversarial Networks “..a generative model G captures the data distribution, ..a discriminative model D estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the CNN probability of D making a mistake” Discriminator [I.Goodfellow.. Y.Bengio 2014] CNN Generator Noise A conditional l gen enerative model p(x | c) can be obtained by adding c as input to both G and D Low Resolution Incomplete
With a GAN from Noise.. RAP RAP: A Richly Annotated Dataset for Pedestrian Attribute Recognition [http://rap.idealtest.org/] Dataset dimension: - 41,585 pedestrian samples - 33,268 for training - 8,317 for testing Fabbri, Calderara, Cucchiara Dataset image resolution: Generative Adversarial Models for People - from 36x92 to 344x554 Attribute Recognition in Surveillance Generated Dataset IEEE AVSS 2017
Generative Adversarial Network for De-occlusion (or Super-Resolution) RAP RAP occRAP Generator Compare (SSE) occRAP by Imagelab De-occluded Original image Encoder Decoder Occluded (Fake) (Real) Discriminator Cross Entropy lowRAP by Imagelab
Selected Architecture (3x160x64) (128x160x64) (3x160x64) Decoder Encoder (256x80x32) (256x80x32) (256x40x16) (256x40x16) (512x20x8) (512x20x8) (1024x10x4) Generator SConv4 TConv1 5x5 SConv3 5x5 Stride2 TConv2 5x5 Upsample2 5x5 Stride2 Upsample2 SConv2 5x5 Stride2 TConv3 5x5 SConv1 Upsample2 5x5 Stride2 TConv4 5x5 Batch-Norm Leaky-ReLU Upsample2 Batch-Norm Conv SConv2 5x5 TConv3 Input ReLU Output (3x160x64) (128x80x32) (256x40x16) (512x20x8) (1024x10x4) Discriminator (1x1x1) SConv4 Classification 5x5 SConv3 Stride2 5x5 (fake or real) Stride2 SConv2 5x5 Stride2 SConv1 5x5 Stride2 SConv2 Input image Batch- Leaky- Norm ReLU (fake or real)
RESULTS De-occlusion Super Resolution
The Complete Approach: De-occlusion and Super-resolution For Aspect Recognition Attribute Class lassifi fication Network de details batch size: 8 GPU: 1080ti Training time: 24 hours Rec econstruction GAN AN (f (for or de deocclusi sion) ) de details batch size: 256 GPU: 1080ti Training time: 48 hours Sup Super Res esolution GAN AN (f (for or im imag age res esolution) ) de details batch size: 128 GPU: 1080ti Training time: 72 hours
Attribute classification ✓ More than 75% of precision and recall for 50 people attributes on RAP ✓ Acceptable results for occluded shapes and good for low resolution shapes
If People Detection still not solved.. ..without detection EU-ER- FESR 2015-2018 TRACKING HUMANS IN THE WILD BY JUNCTIONS WITH CMP (thanks to Fabio Lanzi, and Simone Calderara)
State-of of-the the-art: Recurrent Nets for object tracking For lon long-term-trackin ing -YOLO network for detection (fine tuned on PascalVOC) - NVIDIA GTX1080 GPU 45fps (python TensorFlow) - -70fps with precomputed YOLO features Recu ecurrence is is p provided by an LS LSTM From [D.Zhang, H,Maei,X.Wang, Y-F. Wang Samsung, UCSB ArXiv 2017] Very fast. Still very low accuracy..
Recurrence with CPM ✓ CP CPM Con Convolutional l Pos ose e Machines*: a sequence of Convolutional nets that repeatedly produce 2D belief maps for the location of interesting parts (human junctions) ✓ Belief map is a non-parametric encoding of the spatial uncertainly of location. ✓ CPM learns implicit relationships between parts ✓ it is not recurrent but t a mult lti-stage network, trained with backpropagation *[S-E Wei, V.Ramakrishna,T.Kanade,Y.Sheickh «Convolutional Pose Machines»CVPR 2016]
Without detection: Temporal CPM3 Imagelab: tracking multiple body parts with T-CP CPM (T (Tem emporal Co Convolutional l Pos ose e Mach chin ines). An iterative network (CPM) for predicting: • the position of joints (H) • their mutual association in space (P) • their association in time (T)
Three Branches: : Heatmaps, , PAFs and and TAFs • Heatmap models the part locations as gaussian peaks in the map; 1 for each joint (“nose”, “neck”, “lest -sholder ”.) • PAFs: (P (Part Affin inity Fi Field) to assemble the detected joints. The score of a candidate limb is proportional to the alignment with the PAF associated with that type of limb. • TAFs:(Temporal Affin init ity Fi Field ld) to link the corresponding joints of the same person in consecutive frames (for an unknown number of people). “ left knee ” joint PAF vector connecting two nodes TAF vector connecting the same node in the time
Visual Example
How to provide initial annotation? ScriptHook library • • Access to native GTA Photorealistic • functions Plausible dynamics • • Lifelike entity AI Customizable • Extract all the information available to the game engine
T-CPM 3 In Action On Tracking People In The Wild The Deep architecture and the software is propriety of ImageLab UNIMORE, We thanks Jum Jump pr project fu funded with th EU EU ER ER-FESR-2015 2015-2020 pr program
For Tracking, Action, Behavior Recognition T-CPMs do not use recurrence but works on sequences of frames and refines s wit ith it iteration with th lon ong convolu lutional layers -Problems of vanishing gradient -Long-Short Term Memory architectures can give solution for time iterations, but not for long time sequences
If target Detection is not required.. SALIENCY DETECTION WITH LSTMS SAM ARCHITECTURE (thanks to Marcella Cornia, Giuseppe Serra and Lorenzo Baraldi)
SALIENCY DETECTION @Imagelab SAM ✓ SAM MIT300 (Itti, Torralba et al) more than 70 competitors since 2014 SALICON (Jiang et al 2015), 10000 images; Saliency Attentive Model (SAM): Marcella Cornia, L. Baraldi, G. Serra, and ML-NET+ LSTMs R. Cucchiara. A Deep Multi-Level Network for Saliency Prediction , ICPR CPR 2016
Number of of im images: 20.000 10.000 training images 5.000 validation images 5.000 test images GPU: NVIDIA K80 on Supercomputer GALILEO CINECA Training Tim ime: ~15 hours Winner of the competition LSUN Challenge CVPR 2017
Groundtruth Actions in the Eye (Hollywood2) dataset SAM
Saliency in task-driven video Bottom-up saliency, detected by ML-NET, trained on SALICON on DR(EYE)VE dataset http://imagelab.ing.unimore.it/dreyeve Saliency not driven by a task.. Saliency trained by driving as a passenger sees as a driver sees
SIFT-BASED REGISTRATION FRAME BY FRAME Collected with SMI ETG 2w, Frontal camera 720p/30fps + Eye pupils cameras at 60fps GARMIN VirvX , 1080 p/25fps +GPS.
Some conclusion (if any) ✓ Computer vision now is a Deep Learning based discipline ✓ Computer vision systems cannot be build without GPUs (both in training and at run-time) “ ✓ Conv-Nets are fundamental bricks to new architectures ✓ Autoencoders: for image generation ✓ (Conditional) Generative Adversarial Networks: for low-resolution occluded attribute recognition ✓ Multi- layers convolutional networks for emulating recurrency as T-CPM3 for tracking ✓ Recurrent and Long Short Term Memories for short time analysis: saliency and video captioning ✓ … Computer Vision Deep Architetcures GPUs GPUs
Th Thank you! rit rita.cucchiara@unim imore.it it http://imagela lab.in ing.unimore.it it Acknowle ledgements Thank to to
Recommend
More recommend