Deep Single-View 3D Object Reconstruction with Visual Hull Embedding Hanqing Wang, Jiaolong Yang, Wei Liang, Xin Tong Beijing Institute of Technology Microsoft Research Asia Beijing, China Beijing, China AAAI 2019 1,2 2 1 2 1 2
• Input: a single RGB(D) Image • Output: the corresponding 3D representation Single-View 3D Reconstruction
Single-View 3D Reconstruction • Deep Learning based Methods: [Girdhar ECCV’16] [Choy ECCV’16] Other works: Yan NIPS’16; Wu NIPS’16; Tulsiani CVPR’17; Zhu ICCV’17…
Single-View 3D Reconstruction • Problems of Existing Deep Learning based Methods: • 1. Arbitrary-view images vs. Canonical-view aligned 3D shapes • 2. Unsatisfactory results • Missing shape details • Plausible shapes yet inconsistent with input images 11/15/2018 4 Generation or Reconstruction??? Z Y X
• Goal: Reconstruct the object precisely with the given image • Idea: Embed explicitly the 3D-2D projection geometry into a network • Approach: Estimating a single-view visual hull inside of the network Multi- view Visual Hull Single-view Visual Hull Core Idea
Our Approach • Perspective camera model • Volumetric shape representation • Method overview
Components (R,T) 2D Encoder Regressor2D Encoder 3D Decoder 2D Decoder 3D Decoder + 3D Encoder (a) (d) (b) (c) (e) • (a) V-Net: coarse shape prediction • (b) P-Net: object pose and camera parameters estimation • (c) S-Net: silhouette prediction • (d) PSVH layer: visual hull generation • (e) R-Net: coarse shape refinement
Projection Details The relationship between a 3D point (𝑋, 𝑌, 𝑍) and its projected pixel location (𝑢, 𝑣) on the image is (1) Where the camera intrinsic matrix , is the rotation matrix generated by three Euler angles, noted as , is the translation vector. For translation we estimate 𝑡 𝑍 and a 2D vector [𝑡 𝑢, 𝑡 𝑣] which centralizes the object on image plane, and obtain 𝑡 via 𝑡 𝑢 𝑓 ∗ 𝑡 𝑍, 𝑡 𝑣 𝑓 ∗ 𝑡 𝑍, 𝑡 𝑍 𝑇 . In summary, we parameterize the pose as a 6-D vector 𝑍 𝑢, 𝑣, 1 𝑇 = K(R 𝑋, 𝑌, 𝑍 𝑇 + 𝑡) K = 𝑓 0 𝑢0 0 𝑓 𝑣0 0 0 1 R ∈ SO(3) 𝑡 = 𝑡 𝑋, 𝑡 𝑌, 𝑡 𝑧 𝑇 ∈ ℝ3[𝜃1, 𝜃2, 𝜃3] 𝑝 = 𝜃1, 𝜃2, 𝜃3, 𝑡 𝑢, 𝑡 𝑣, 𝑡 𝑧 𝑇
Network Architecture • Overview:
Training Loss We use the binary cross-entropy loss to train V-Net, S-Net and R-Net, let 𝑝 𝑛 be the estimated probability at location 𝑛, the loss is defined as (2) Where 𝑝 𝑛 ∗ is the target probability For P-Net, we use the 𝐿1 regression loss to train the network: (3) where we set 𝛼 = 1, 𝛾 = 1, 𝛽 = 0.01 𝑙 = − 1 𝑁 ෍ 𝑛 (𝑝 𝑛 ∗ log 𝑝 𝑛 + 1 − 𝑝 𝑛 ∗ log(1 − 𝑝 𝑛)) 𝑙 = ෍ 𝑖=1,2,3 𝛼 𝜃𝑖 − 𝜃𝑖 ∗ + ෍ 𝑗=𝑢,𝑣 𝛽 𝑡𝑗 − 𝑡𝑗 ∗ + 𝛾 𝑡 𝑍 − 𝑡 𝑍 ∗
• Object categories: car, airplane, chair, sofa • Datasets: • 3D-R2N2 dataset – rendered ShapeNet objects • PASCAL 3D+ dataset – real images manfully associated with limited CAD models Experiments
Experiments • Implementation details: • Network implemented in Tensorflow • Input image size: 128x128x3 • Output voxel grid: 32x32x32 • Running time: • ~18ms for one image (i.e. running at 55 fps) • (Tested with a batch of 24 images on a NVIDIA Tesla M40 GPU)
Experiments • Results on the 3D-R2N2 dataset (rendered ShapeNet objects) • Ablation study:
Experiments • Results on the 3D-R2N2 dataset (rendered ShapeNet objects) • Ablation study:
Experiments • Results on the 3D-R2N2 dataset (rendered ShapeNet objects)
• Results on the PASCAL 3D+ dataset (real images) Experiments
Summary • A novel 3D reconstruction neural network structure • Embedding Domain knowledge (3D-2D perspective geometry) into a DNN • Performing reconstruction jointly with segmentation and pose estimation • A novel, GPU-friendly Probabilistic Single-view Visual Hull layer

Deep single view 3 d object reconstruction with visual hull

  • 1.
    Deep Single-View 3DObject Reconstruction with Visual Hull Embedding Hanqing Wang, Jiaolong Yang, Wei Liang, Xin Tong Beijing Institute of Technology Microsoft Research Asia Beijing, China Beijing, China AAAI 2019 1,2 2 1 2 1 2
  • 2.
    • Input: asingle RGB(D) Image • Output: the corresponding 3D representation Single-View 3D Reconstruction
  • 3.
    Single-View 3D Reconstruction •Deep Learning based Methods: [Girdhar ECCV’16] [Choy ECCV’16] Other works: Yan NIPS’16; Wu NIPS’16; Tulsiani CVPR’17; Zhu ICCV’17…
  • 4.
    Single-View 3D Reconstruction •Problems of Existing Deep Learning based Methods: • 1. Arbitrary-view images vs. Canonical-view aligned 3D shapes • 2. Unsatisfactory results • Missing shape details • Plausible shapes yet inconsistent with input images 11/15/2018 4 Generation or Reconstruction??? Z Y X
  • 5.
    • Goal: Reconstructthe object precisely with the given image • Idea: Embed explicitly the 3D-2D projection geometry into a network • Approach: Estimating a single-view visual hull inside of the network Multi- view Visual Hull Single-view Visual Hull Core Idea
  • 6.
    Our Approach • Perspectivecamera model • Volumetric shape representation • Method overview
  • 7.
    Components (R,T) 2D Encoder Regressor2D Encoder3D Decoder 2D Decoder 3D Decoder + 3D Encoder (a) (d) (b) (c) (e) • (a) V-Net: coarse shape prediction • (b) P-Net: object pose and camera parameters estimation • (c) S-Net: silhouette prediction • (d) PSVH layer: visual hull generation • (e) R-Net: coarse shape refinement
  • 8.
    Projection Details The relationshipbetween a 3D point (𝑋, 𝑌, 𝑍) and its projected pixel location (𝑢, 𝑣) on the image is (1) Where the camera intrinsic matrix , is the rotation matrix generated by three Euler angles, noted as , is the translation vector. For translation we estimate 𝑡 𝑍 and a 2D vector [𝑡 𝑢, 𝑡 𝑣] which centralizes the object on image plane, and obtain 𝑡 via 𝑡 𝑢 𝑓 ∗ 𝑡 𝑍, 𝑡 𝑣 𝑓 ∗ 𝑡 𝑍, 𝑡 𝑍 𝑇 . In summary, we parameterize the pose as a 6-D vector 𝑍 𝑢, 𝑣, 1 𝑇 = K(R 𝑋, 𝑌, 𝑍 𝑇 + 𝑡) K = 𝑓 0 𝑢0 0 𝑓 𝑣0 0 0 1 R ∈ SO(3) 𝑡 = 𝑡 𝑋, 𝑡 𝑌, 𝑡 𝑧 𝑇 ∈ ℝ3[𝜃1, 𝜃2, 𝜃3] 𝑝 = 𝜃1, 𝜃2, 𝜃3, 𝑡 𝑢, 𝑡 𝑣, 𝑡 𝑧 𝑇
  • 9.
  • 10.
    Training Loss We usethe binary cross-entropy loss to train V-Net, S-Net and R-Net, let 𝑝 𝑛 be the estimated probability at location 𝑛, the loss is defined as (2) Where 𝑝 𝑛 ∗ is the target probability For P-Net, we use the 𝐿1 regression loss to train the network: (3) where we set 𝛼 = 1, 𝛾 = 1, 𝛽 = 0.01 𝑙 = − 1 𝑁 ෍ 𝑛 (𝑝 𝑛 ∗ log 𝑝 𝑛 + 1 − 𝑝 𝑛 ∗ log(1 − 𝑝 𝑛)) 𝑙 = ෍ 𝑖=1,2,3 𝛼 𝜃𝑖 − 𝜃𝑖 ∗ + ෍ 𝑗=𝑢,𝑣 𝛽 𝑡𝑗 − 𝑡𝑗 ∗ + 𝛾 𝑡 𝑍 − 𝑡 𝑍 ∗
  • 11.
    • Object categories:car, airplane, chair, sofa • Datasets: • 3D-R2N2 dataset – rendered ShapeNet objects • PASCAL 3D+ dataset – real images manfully associated with limited CAD models Experiments
  • 12.
    Experiments • Implementation details: •Network implemented in Tensorflow • Input image size: 128x128x3 • Output voxel grid: 32x32x32 • Running time: • ~18ms for one image (i.e. running at 55 fps) • (Tested with a batch of 24 images on a NVIDIA Tesla M40 GPU)
  • 13.
    Experiments • Results onthe 3D-R2N2 dataset (rendered ShapeNet objects) • Ablation study:
  • 14.
    Experiments • Results onthe 3D-R2N2 dataset (rendered ShapeNet objects) • Ablation study:
  • 15.
    Experiments • Results onthe 3D-R2N2 dataset (rendered ShapeNet objects)
  • 16.
    • Results onthe PASCAL 3D+ dataset (real images) Experiments
  • 17.
    Summary • A novel3D reconstruction neural network structure • Embedding Domain knowledge (3D-2D perspective geometry) into a DNN • Performing reconstruction jointly with segmentation and pose estimation • A novel, GPU-friendly Probabilistic Single-view Visual Hull layer