make3d learning 3d scene structure from a single still
play

Make3D: Learning 3D Scene Structure from a Single Still Image - PowerPoint PPT Presentation

Make3D: Learning 3D Scene Structure from a Single Still Image Ashoutosh Saxena, Min Sun, and Andrew Ng Ian Endres CS598 February 5, 2009 Overview Goal:Infer 3D models from monocular cues Segment into planar patches Build model from


  1. Make3D: Learning 3D Scene Structure from a Single Still Image Ashoutosh Saxena, Min Sun, and Andrew Ng Ian Endres CS598 February 5, 2009

  2. Overview Goal:Infer 3D models from monocular cues ◮ Segment into planar patches ◮ Build model from depth maps ◮ Estimate orientation/location of patches ◮ Construct 3D model

  3. Properties to Model ◮ Single Image Depth from features Connectedness Coplanarity ◮ Multiple Images Depths from triangulation ◮ Objects Object A is above B Object Orientation

  4. Superpixel Model ◮ Superpixel as a plane ◮ Model as a 3d mesh of polygons ◮ Use Felzenzwalb and Huttenlocher’s segmenter ◮ Goal: Determine location and orientation of each superpixel

  5. Superpixel Parameters ◮ α ∈ R 3 ◮ ˆ α = | α | unit normal of plane α 1 | α | distance from camera center to plane ◮ ◮ Thus, q T α = 1 for any point q ∈ R 3 on plane ◮ R i ∈ R 3 Unit length ray pointing from camera center to pixel i on image plane (using “reasonable guess” of camera’s intrinsic parameters). 1 ◮ d i = i α is distance of point i (having ray R i ) from camera R T center if it lies on plane described by α

  6. Features ◮ Monocular Features: x i ∈ R 524 ◮ Filter response + shape computed for each superpixel ◮ Additional contextual information from neighbors, at 3 scales Uses features from largest superpixel neighbor in each bin (i.e. S1C) ◮ Boundary Features: ǫ ij ∈ { 0 , 1 } 14 ◮ Segmentations based on 7 different properties at 2 scales Properties include color, texture, and edges ◮ For each segmentation k , if superpixels i , j fall on same segment, ǫ ij ( k ) = 1, otherwise 0

  7. Models ◮ P ( y ij | ǫ ij ; ψ ) - models the confidence of superpixels i , j belonging to same planar surface (0 for boundary/fold - 1 for planar) ◮ P ( α | X , v , y , R ; θ ) - models depth and orientation parameters of superpixels, composed of: ◮ f 1 ( α i | X i , v i , R i ; θ ) - plane parameters as a function of single superpixel i features ◮ f 2 ( α i , α j | y ij , R i , R j ) - plane parmeters as a function of edge features between superpixels i , j ◮ P ( v i | x i ; φ r ) - models each pixel’s ability to predict parameters of associated superpixel

  8. Occlusion Boundary and Fold Model ◮ Simple edge detector not sufficient for detecting 3d discontinuities (consider a shadow) ◮ y i , j ∈ [0 , 1] where 0 indicates boundary/fold, 1 indicates planar surface ◮ y ij hand labeled in 50 images 1 ◮ P ( y ij | ǫ ij ; ψ ) = 1+exp ( − ψ T ǫ ij ) learned using logistic regression

  9. Unary Depth Model ( f 1 ) ◮ Predict depth ˆ d as a function of features x ◮ Penalize using relative error ˆ d − 1 where ˆ d d = x T θ r 1 d = R T Note: i , s i α i � � s i v i , s i | R T i , s i α i ( x T f 1 ( α i | X i , v i , R i ; θ ) = exp − � i , s i θ r ) − 1 | ◮ The r in θ r indicates one of 11 rows in the image ◮ Parameters learned from pseudo log-likelihood of P ( α | . . . ) Since f 2 ( · ) does not depend on θ r , this gives: θ ∗ 1 � � d i , si ( x T r = argmin θ r s i v i , s i | i , s i θ r ) − 1 | i

  10. Depth Prediction Confidence ( v ) ◮ Given a model ˆ d = x T i θ r for predicting depth, build a model to predict expected error ◮ Thus learn | d i − x T i θ r | 1 = 1+exp ( − φ T d i r x i ) ◮ This (ideally) can predict how well a feature predicts the depth of a pixel 1 ◮ Presumably, v = 1 − r x i ) , indicating confidence of 1+exp ( − φ T prediction ability

  11. Superpixel Interaction Models ( f 2 ) ◮ f 2 ( α i , α j | y ij , R i , R j ) = � { s i , s j }∈ N h s i , s j ( α i , α j | y ij , R i , R j ) ◮ s i , s j are pixels from superpixels i , j respectively, chosen according to the figure depending on property to be modeled (i.e. connectivity, planarity, linearity) ◮ h ( · ) also depends on property

  12. Connectivity and Co-planarity Neighboring superpixels tend to be connected if no occlusion ◮ Uses pairs of neighboring pixels ( s i , s j ) chosen along boundarise of superpixels i , j | d i , si − d j , sj | � √ d i , si d j , sj � ◮ h s i , s j = exp − y ij Neighboring superpixels tend to belong to the same plane if no fold ◮ A pair ( s ′′ i , s ′′ j ) is chosen from the centers of each superpixel i , j respectively j α j )ˆ � − y ij | ( R T j α i − R T � ◮ h s ′′ j = exp d | j , s ′′ j , s ′′ ◮ Penalizes distance between s j and s j projected onto plane i ◮ h s ′′ i = h s ′′ j ( · ) h s ′′ i ( · ) j , s ′′

  13. Co-linearity Superpixels lying on a straight line are likely to lie on the same plane ◮ Same penalty as Co-planar term, except superpixels i , j aren’t adjacent ◮ Also, y i , j is computed from lines in the image instead of the occ/fold model

  14. Inference ◮ α ∗ = argmax α log P ( α | X , v , y , R ; θ r ) = argmax α log 1 � i f 1 ( α i | X i , v i , R i ; θ ) � i , j f 2 ( α i , α j | y ij , R i , R j ) Z ◮ Each term results in L1 norm of a linear function of α ◮ Solved via a Newton method with smooth approximation of L1 norm

  15. Experiments ◮ Depth maps from laser scanner, plus corresponding image (400 training, 134 test) ◮ Images from urban and natural scenes from daytime ◮ 588 Additional test images from internet (no depth map) ◮ Evaluation: Predict depths, then render 3d model ◮ % qualitatively correct ◮ % major planes correctly identified ◮ Average depth error log 10 : | log d − log ˆ d | ◮ Relative depth error | d − ˆ d | d

  16. Performance

  17. Results 1

  18. Other Tasks ◮ 3D model from multiple images Adds extra term ( f 3 ) which penalizes depth discrepencies when 3d correspondences exist between images ◮ Incorporating Object Information Object A is on top of object B Object A is connected to Object B - such as person’s feet on ground Object A has known orientation - such as people standing upright

Recommend


More recommend