visual inertial odometry and object mapping with
play

Visual-Inertial Odometry and Object Mapping with Structural - PowerPoint PPT Presentation

Visual-Inertial Odometry and Object Mapping with Structural Constraints Mo Shan and Nikolay Atanasov Department of Electrical and Computer Engineering 1 / 36 SLAM Simultaneous Localization And Mapping (SLAM): a model of the environment


  1. Visual-Inertial Odometry and Object Mapping with Structural Constraints Mo Shan and Nikolay Atanasov Department of Electrical and Computer Engineering 1 / 36

  2. SLAM • Simultaneous Localization And Mapping (SLAM): a model of the environment (the map), and the estimation of the state of the robot moving within it (C. Cadena et al., 2016). Figure: SLAM framework. 2 / 36

  3. Factor graph • SLAM as a factor graph Figure: Factor graph. Blue circles: robot poses, green circles: landmark positions, red circle: variable of intrinsic parameters (K). u: odometry constraints, v: camera observations, c: loop closures, p: prior factors. 3 / 36

  4. Motivation Object-level semantics are important for • improving performance of feature tracking • reducing drift via loop closure • obtaining compressed maps of objects for subsequent tasks Figure: An object map. 4 / 36

  5. Objective A robot equipped with an IMU and RGB camera, localize the robot using visual-inertial odometry (VIO), and map the objects composed of semantic landmarks in the scene using: • inertial observations: linear acceleration and angular velocity • geometric measurements from geometric landmarks • semantic measurements from keypoints on objects 5 / 36

  6. State of the Art • Traditional VIO, SLAM approaches such as ORB SLAM (Mur-Artal et al., 2017), DSO (J. Engel et al., 2016) rely on geometric features, eg ORB, SIFT, but overlook objects • Learning-based approaches that use convolutional neural networks (CNNs) only regress camera pose but do not produce meaningful maps • Initial attempts on object-level SLAM often use iterative optimization as well as complicated object CAD models 6 / 36

  7. Contribution We exploits the object semantics to • obtain uncertainty estimates for the semantic feature locations • achieve probabilistic tracking of composite semantic features, i.e., at the object level • exploit object structure constraints (e.g., the wheels of a car should not be very close or far away to each other) to execute an accurate estimate 7 / 36

  8. Objects • Objects in the environment O � { ( o i , c i ) } N o i =1 • Object of class c i ∈ C o defined by N s ( c i ) semantic keypoints. • There also exits the pairwise category-specific constraint arising from the shape prior 8 / 36

  9. Problem formulation Given measurements { i z t , g z t , c z t , s z t , b z t } T t =1 , determine the sensor trajectory X and the object states O that maximize the measurement likelihood: T � log( p ( i z t |X ) p ( g z t |X ) p ( c z t , b z t , s z t |O , X )) max (1) O , X t =1 The likelihood terms above can be defined as Gaussian density functions. Variances are determined by the measurement noise. Means are determined by the dynamic equations of motion over the SE (3) Lie group and the camera perspective model. 9 / 36

  10. Front-end • We use a stacked hourglass convolutional network to extract mid-level semantic features and their uncertainties, used for the probabilistic tracking of composite semantic features 10 / 36

  11. Keypoint detection • StarMap produces heatmap for all keypoints. • Corresponding features as 3D locations in the canonical object view (CanViewFeature) • Augmented with an additional depth channel (DepthMap) to lift the 2D keypoints to 3D Figure: Starmap. 11 / 36

  12. MC dropout Figure: Starmap. 12 / 36

  13. MC dropout The Monte Carlo estimate is named MC dropout, and defined as in Eq. 2 B y mc = 1 � ˆ y i ˆ B i =1 (2) B η mc = 1 � y ) 2 ˆ (ˆ y i − ˆ B i =1 MC dropout approximately integrates over the models weights and can be interpreted as a Bayesian approximation of a Gaussian process (Y. Gal, 2016). 13 / 36

  14. Object-level tracking • Use Kalman Filter to fuse the detection and tracking: KanadeLucasTomasi (KLT) feature tracker for prediction and keypoint detection as update. • The state for object i at time t is � � y N kp a i y 1 t = x b (3) ... t t t t � ( b x 1 x 1 t , b y 1 y 1 t , b x 2 x 2 t , b y 2 y 2 where x b t , b ˙ t , b ˙ t , b ˙ t , b ˙ t ) contains the coordinates of the object bounding box and their velocities, and y j t � ( k x t , k ˙ x t , k y t , k ˙ y t ) , j ∈ 1 ... N kp represents the coordinates and velocities of semantic keypoints. • The tracker jointly tracks the bounding box and all the N kp semantic keypoints on each car. 14 / 36

  15. Notation • We denote the global frame by { G } , the IMU frame by { I } , and the camera frame by { C } . • The transformation from { I } to { C } is specified by a I p ∈ R 3 and unit quaternion C translation C I q using a left-handed JPL convention • Alternatively via a transformation matrix: � C C � I R I p C I T � ∈ SE (3) , (4) 0 1 15 / 36

  16. Back-end • EKF prediction: � ˆ � x k | k − 1 = f ˆ x k − 1 | k − 1 , u k P k | k − 1 = F k P k − 1 | k − 1 F ⊤ k + Q k • EKF update: � ˆ � ˜ y k = z k − h x k | k − 1 S k = H k P k | k − 1 H ⊤ k + R k K k = P k | k − 1 H ⊤ k S − 1 k ˆ x k | k = ˆ x k | k − 1 + K k ˜ y k P k | k = ( I − K k H k ) P k | k − 1 • where F k = ∂ f � � ∂ x ˆ x k − 1 | k − 1 , u k H k = ∂ h � � ∂ x ˆ x k | k − 1 16 / 36

  17. VIO background • The state of the IMU is defined as ∈ R 16 , I x � � � I ¯ q b g I v b a I p (5) • Our objective: estimate the true state I x with an estimate I ˆ x : ˆ ˆ ∈ R 16 . x � � I ˆ � I ˆ ¯ q b g I ˆ v b a I ˆ p (6) • The IMU error state is: � � I ˜ ∈ R 15 . x � ¯ ˜ ˜ I ˜ (7) θ b g I ˜ v b a I ˜ p • I ˜ 2 ˜ ¯ q ≃ [ 1 ¯ θ ⊤ θ is the angle axis representation of I ˜ q , and ˜ 1] ⊤ ¯ ¯ 17 / 36

  18. State augmentation • Keep a history of the camera poses of length W + 1. The camera state and error state are: x � ( C ˜ ¯ p ) ∈ R 6( W +1) . C x � ( C ¯ C ˜ θ, C ˜ q , C p ) , (8) • The complete state and error state at time t are: x t � x t � � � ˜ � I ˜ C ˜ � I x t C x t − W : t , x t x t − W : t . (9) 18 / 36

  19. Prediction • We can discretize the state estimate dynamics to obtain the prediction step for the IMU state mean • Linearized continuous-time IMU error state dynamics satisfy: I ˙ ˜ x = F ( t ) I ˜ x + G ( t ) n I (10) • The propagated covariance of the IMU state is P II t +1 | t = Φ t P II t | t Φ t + Q t (11) n I n ⊤ � � • where Q = E is continuous noise covariance I � t Φ t = Φ ( t , t + 1) = exp( F ( τ )) d τ t +1 � t +1 Φ ( t + 1 , τ ) GQGΦ ( t + 1 , τ ) ⊤ d τ Q t = t 19 / 36

  20. Prediction • The covariance matrix after augmentation with a new camera state is � ⊤ � I 15+6( W +1) � � I 15+6( W +1) P t +1 | t = P t +1 | t (12) J t J t • We obtain the Gaussian pdf p ( i z t | X ) in (1) 20 / 36

  21. EKF vs MSCKF • EKF: Many features constrain one state. • MSCKF: One feature constrains many states. Figure: Comparison of EKF, MSCKF. 21 / 36

  22. Update • The measurement model relating a landmark ℓ ∈ L to its observation z t in camera frame { C t } is: � � C t R ⊤ ( ℓ − C t p ) z t = π + n t (13) ℓ j is used to define a residual r j via first-order • The estimate g ˆ Taylor series linearization of g z j t − W : t based on (13): r j = g z j z j x + H j g ˜ t − W : t − g ˆ t − W : t ≈ H j ℓ j + n j x ˜ (14) ℓ • MSCKF update, p ( g z t | X ) in (1): o = A ⊤ r j ≈ A ⊤ H j x + A ⊤ n j = H j r j x + n j x ˜ o ˜ o . (15) 22 / 36

  23. Constrained filtering • MSCKF with Persistent Object States C 1 ℓ ∨ C k ℓ ∨ � � x t = I ˜ C ˜ (16) x t x t − W : t ... 1 k • The original measurement model in EKF SLAM as in Eq. 13 is z = Hx t + n where x t is the state vector defined in eq. 16. The measurement model could be augmented to � n � z � � H � � = x t + (17) d D n c where the constraint is enforced as Dx t + n c = d , and n c is noise with covariance Σ c . 23 / 36

  24. Constrained filtering • Landmarks annotations ℓ p ∼ N ( µ p , Σ p ), ℓ q ∼ N ( µ q , Σ q ) • The Euclidean distance d = || ℓ p − ℓ q || 2 , where ∆ ℓ = ℓ p − ℓ q ∼ N ( µ p − µ q , Σ p + Σ q ). • Covariance of d is A ( Σ p + Σ q ) A ⊤ , where A is the Jacobian of the L 2 norm. Figure: Pairwise constraints. 24 / 36

  25. Constrained filtering • Constrained filtering could fuse all available sources of information (S. Tully et al., 2012) Figure: Posterior with equalities and inequalities constraints. 25 / 36

  26. Quantitative Comparison Enforcing constraints could keep the points close to groundtruth with large measurement noise Figure: Left: 640 × 480 image, birdeye view. Right: RMSE comparison between Hybrid VIO and OrcVIO in Gazebo Simulation. 26 / 36

  27. Qualitative evaluation • Gazebo simulation using real-world IMU data • Reconstruction for 22 cars • Drift in Z is large due to insufficient movement 27 / 36

  28. Qualitative evaluation • Semantic keypoint detection using StarMap. Upper row: successes. Lower row: failures. 28 / 36

  29. Qualitative evaluation • Semantic feature detection on real-world dataset 29 / 36

  30. Qualitative evaluation Reconstruction snapshot on real-world dataset Figure: Visulization of reconstruction. 30 / 36

Recommend


More recommend