When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous Xi Sun, Xinshuo Weng, Kris Kitani Robotics Institute, Carnegie Mellon University IROS 2020
Motivation Use case 1: Autonomous vehicle identifies and locates its user Use case 2: Assistive robot tries to locate the target person for the first time 2
Proposed Task Given a query IMU sequence from a person’s smartphone, locate the person in the video that the IMU comes from. 3
Why IMU? • Inertial measurements (accelerometer and gyroscope readings) provides rich relative 3D motion information • People often carry smart devices (smart-phones and smart watches) which are equipped with inertial sensors • IMU data can be easily and selectively transmitted at a low cost • Contains minimal biometric or privacy-sensitive information 4
Prior Work on Visual-Inertial Person Identification and Tracking Graph optimization problem • Node: predict person’s orientation with respect to the camera from a single image using a VGG16 based network • Edge: estimate 3D foot position of the person at each frame to compute 3D velocity between pairs of node images Hand-crafted inertial features to match with visual data Henschel, Roberto, Timo von Marcard, and Bodo Rosenhahn. "Simultaneous identification and tracking of multiple people using video and IMUs." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops . 2019. 5
Proposed Approach For the query IMU data from a person labeled by index n ∈ [ N ] , where is the number of people in the video, N extract the inertial feature . g IMU For each candidate person in the video, extract visual feature . g VIS Learn two mappings H VIS : g VIS → f, H IMU : g IMU → f such that the transformed features from the same person lie closely in joint feature space. 6
Framework Feature Extraction Feature Encoding Joint Feature Space IMU
Inertial Feature Extraction IMU data: 3D linear acceleration and angular velocity in smartphone local frame Pre-processing: 1. Uniform sampling to fix number of samples synchronized with video frames 2. Low-pass filtering for smoothing 8
Framework Feature Extraction Feature Encoding Joint Feature Space IMU Bounding box size trajectory Human pose Positive keypoint trajectory TSP optical flow
Visual Feature Extraction 1. Person detection with YOLOv3 and tracking with DeepSORT 2. Decompose person tracklets into Temporal Super-Pixels (TSP) 3. Compute average optical flow for each TSP Person Tracklets Temporal Super-Pixels with Person labels Farhadi, Joseph Redmon Ali, and Joseph Redmon. "YOLOv3: An incremental improvement." Retrieved September 17 (2018): 2018. Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." 2017 IEEE international conference on image processing (ICIP) . IEEE, 2017. 10
Temporal Super-Pixels as Visual Representation of Human Motion Mask Pixels Coarse Fine Chang, Jason, Donglai Wei, and John W. Fisher. "A video representation using temporal superpixels." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2013. 11
Ambiguity in 2D and 3D Feature Matching Optical flow only capture a 2D projection of the person’s 3D motion Factors that would cause ambiguity in matching 3D inertial feature to optical flow: Similar inertial measurements but di ff erent optical flow 1. Person’s distance to the camera v 2. Person’s orientation to the camera v 12
Additional Visual Cues to Address Ambiguity 1. Bounding box size trajectory (generated from YOLO person detection) 2. Human pose keypoints trajectory (generated from AlphaPose) • Left and right shoulder keypoints relative position to the center of the bounding box Xiu, Yuliang, et al. "Pose flow: Efficient online pose tracking." BMVC 2018. 13
Framework Feature Extraction Feature Encoding Joint Feature Space Conv1D LSTM-IMU IMU Bounding box size LSTM-Box trajectory Human pose LSTM-Pose FC Positive keypoint trajectory LSTM-OpticalFlow TSP optical flow
Framework Feature Extraction Feature Encoding Joint Feature Space Conv1D LSTM-IMU IMU Bounding box size LSTM-Box trajectory Human pose Triplet LSTM-Pose FC Positive keypoint trajectory Loss LSTM-OpticalFlow TSP optical flow Share Weights Bounding box size LSTM-Box trajectory Human pose FC LSTM-Pose keypoint trajectory Negative LSTM-OpticalFlow TSP optical flow
Learning the Visual-Inertial Feature Space • LSTM to encode inertial and visual features • Triplet loss using Euclidean distance between the visual (positive and negative person) and inertial (query) feature embeddings Visual and Inertial Feature Encoders: Extracted visual feature with TSPi from positive person Extracted visual feature with TSPj from negative person IMU , g + VIS ( ξ j )) = max( || H VIS ( g + L ( g n VIS ( ξ i )) − H IMU ( g n VIS ( ξ i ) , g − IMU ) || 2 − VIS ( ξ j )) − H IMU ( g n || H VIS ( g − IMU ) || 2 + κ , 0) Predicted IMU source person in the video: Average feature distance between all TSPs from a candidate person and the query IMU 16
Experimental Setup • We collect our own dataset for implementation and evaluation. • We evaluate our framework on test videos with di ff erent number of people (from 2 to 5) in the scene. • We assume the task complexity would increase as the number of people in the video increases. (There are more potential false positives) 17
Data Collection Video • Recorded with a non-calibrated webcam at about 1 meter high from the ground • 30fps 1080p IMU • Recorded with a hand-held iPhone • 100 Hz Total length of the video with regard to number of people in the scene 18
Results Compared to Baseline Methods Non-learning based Transform one modality to the other Learning joint feature space 19
Optimal Temporal Window Size Longer time window includes more feature information that helps discriminate similar motions, but decreases sample size due to more occlusions. 20
Ablation Study — inertial feature extraction • Raw IMU accelerometer and gyroscope readings: • Estimated velocity from acceleration integration • Concat • Replace • Low-pass filtering (smoothing): Prediction Accuracy with Di ff erent IMU Features 21
Ablation Study — additional visual cues Prediction Accuracy with Di ff erent Visual Features 22
Visualizing Results large small feature distance Green: IMU source White: predicted IMU source Failure 23
Conclusions Summary • Visual-Inertial dataset with common pedestrian activities • Proposed framework identifies IMU source in the video with 80.7% accuracy among various number of people in the scene, without strict constraints on IMU placement Future work • More data collection in terms of number of people, variation in people’s motion and background scenes • Extend to video recorded from a dynamic camera: for deployment of system on autonomous vehicles or mobile robots 24
Thank you! 25
Recommend
More recommend