Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Object Segmentation Day 2 Lecture 7
Object Segmentation Define the accurate boundaries of all objects in an image 2
Semantic Segmentation Label every pixel! Don’t differentiate instances (cows) Classic computer vision problem Slide Credit: CS231n 3
Instance Segmentation Detect instances, give category, label pixels “simultaneous detection and segmentation” (SDS) Slide Credit: CS231n 4
Object Segmentation: Datasets Pascal Visual Object Classes 20 Classes ~ 5.000 images Pascal Context 540 Classes ~ 10.000 images 5
Object Segmentation: Datasets SUN RGB-D 19 Classes ~ 10.000 images Microsoft COCO 80 Classes ~ 300.000 images 6
Object Segmentation: Datasets CityScapes 30 Classes ~ 25.000 images ADE20K >150 Classes ~ 22.000 images 7
Semantic Segmentation Slide Credit: CS231n CNN COW Extract patch Run through a CNN Classify center pixel Repeat for every pixel 8
Semantic Segmentation Slide Credit: CS231n CNN Run “fully convolutional” network to get all pixels at once 9
Semantic Segmentation Slide Credit: CS231n CNN Smaller output due to pooling Problem 1: 10
Learnable upsampling Long et al. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015 Learnable upsampling! Slide Credit: CS231n 11
Reminder: Convolutional Layer Slide Credit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 12
Reminder: Convolutional Layer Slide Credit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 13
Reminder: Convolutional Layer Slide Credit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 14
Reminder: Convolutional Layer Slide Credit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 15
Reminder: Convolutional Layer Slide Credit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 16
Reminder: Convolutional Layer Slide Credit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 17
Learnable Upsample: Deconvolutional Layer Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 18
Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Learnable Upsample: Deconvolutional Layer 19
Learnable Upsample: Deconvolutional Layer Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Sum where output overlaps 20
Learnable Upsample: Deconvolutional Layer Warning: Checkerboard effect when kernel size is not divisible by the stride Source: distill.pub 21
Learnable Upsample: Deconvolutional Layer Source: distill.pub stride = 2, kernel_size = 3 22 Warning: Checkerboard effect when kernel size is not divisible by the stride
Semantic Segmentation Slide Credit: CS231n Noh et al. Learning Deconvolution Network for Semantic Segmentation. ICCV 2015 “Regular” VGG “Upside down” VGG 23
Better Upsampling: Subpixel Re-arange features in previous convolutional layer to form a higher resolution output Shi et al.Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network.CVPR 2016 24
Semantic Segmentation CNN Blobby-like segmentations Problem 2: High-level features (e.g. conv5 layer) from a pretrained classification network are the input for the segmentation branch 25
Skip Connections Slide Credit: CS231n Skip connections = Better results “skip connections” Long et al. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015 Recovering low level features from early layers 26
Dilated Convolutions Yu & Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016 Structural change in convolutional layers for dense prediction problems (e.g. image segmentation) ● The receptive field grows exponentially as you add more layers → more context information in deeper layers wrt regular convolutions ● Number of parameters increases linearly as you add more layers 27
Instance Segmentation Detect instances, give category, label pixels “simultaneous detection and segmentation” (SDS) Slide Credit: CS231n 28
Instance Segmentation More challenging than Semantic Segmentation ● Number of objects is variable ● No unique match between predicted and ground truth objects (cannot use instance IDs) Several attack lines: ● Proposal-based methods ● Recurrent Neural Networks 29
Proposal-based Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014 External Segment proposals Mask out background with mean image Similar to R-CNN, but with segment proposals 30
Proposal-based Slide Credit: CS231nHariharan et al. Hypercolumns for Object Segmentation and Fine-grained Localization. CVPR 2015 31
Proposal-based Instance Segmentation: MNC Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016 Won COCO 2015 challenge (with ResNet) Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression Mask out background, predict object class Learn entire model end-to-end! Faster R-CNN for Pixel Level Segmentation in a multi-stage cascade strategy 32
Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016 Predictions Ground truth Proposal-based Instance Segmentation: MNC 33
He et al. Mask R-CNN. arXiv Mar 2017 Proposal-based Instance Segmentation: Mask R-CNN Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks and class labels 34
He et al. Mask R-CNN. arXiv Mar 2017 Mask R-CNN ● Classification & box detection losses are identical to those in Faster R-CNN ● Addition of a new loss term for mask prediction: The network outputs a K x m x m volume for mask prediction, where K is the number of categories and m is the size of the mask (square) 35
He et al. Mask R-CNN. arXiv Mar 2017 Mask R-CNN: RoI Align Reminder: RoI Pool from Fast R-CNN Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w x/16 & rounding → misalignment ! + not differentiable 36
Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Mask R-CNN: RoI Align Use bilinear interpolation instead of cropping + maxpool 37 Mapping given by box coordinates ( 12 and 21 = 0 translation + scale)
38
He et al. Mask R-CNN. arXiv Mar 2017 Mask R-CNN Object Detection Instance Segmentation 39
Recurrent Instance Segmentation Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 40 Sequential mask generation
Recurrent Instance Segmentation Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 41 Mapping between ground truth and predicted masks ?
Recurrent Instance Segmentation: Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 1-Compute the IoU for all pairs of Predicted/GT masks Ŷt Yt 42 Coverage Loss
Recurrent Instance Segmentation: Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 1-Compute the IoU for all pairs of Predicted/GT masks 0.9 0 0 0.1 0.8 0.1 ... ... ... ... 43 Coverage Loss
Recurrent Instance Segmentation: Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 2-Find best matching: Loss: Sum of the Intersections over the union for the best matching (*-1) 44 Coverage Loss
Recurrent Instance Segmentation: Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 3-Also take into account the scores s1 = 0.93 s2 = 0.73 s3 = 0.86 s4 = 0.63 s5 = 0.56 Where: is the binary cross entropy: is the Iverson bracket which: Is 1 if the condition is true and 0 else 45 Coverage Loss
Recurrent Instance Segmentation: Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 4-Add everything together 46 Coverage Loss
Recurrent Instance Segmentation Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 47
Summary Segmentation Datasets Semantic Segmentation Methods ● Deconvolution ● Dilated Convolution ● Skip Connections Instance Segmentation Methods ● Proposal-Based ● Recurrent 48
Questions ?

Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)

  • 1.
    Amaia Salvador amaia.salvador@upc.edu PhD Candidate UniversitatPolitècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Object Segmentation Day 2 Lecture 7
  • 2.
    Object Segmentation Define theaccurate boundaries of all objects in an image 2
  • 3.
    Semantic Segmentation Label everypixel! Don’t differentiate instances (cows) Classic computer vision problem Slide Credit: CS231n 3
  • 4.
    Instance Segmentation Detect instances, givecategory, label pixels “simultaneous detection and segmentation” (SDS) Slide Credit: CS231n 4
  • 5.
    Object Segmentation: Datasets PascalVisual Object Classes 20 Classes ~ 5.000 images Pascal Context 540 Classes ~ 10.000 images 5
  • 6.
    Object Segmentation: Datasets SUNRGB-D 19 Classes ~ 10.000 images Microsoft COCO 80 Classes ~ 300.000 images 6
  • 7.
    Object Segmentation: Datasets CityScapes 30Classes ~ 25.000 images ADE20K >150 Classes ~ 22.000 images 7
  • 8.
    Semantic Segmentation Slide Credit:CS231n CNN COW Extract patch Run through a CNN Classify center pixel Repeat for every pixel 8
  • 9.
    Semantic Segmentation Slide Credit:CS231n CNN Run “fully convolutional” network to get all pixels at once 9
  • 10.
    Semantic Segmentation Slide Credit:CS231n CNN Smaller output due to pooling Problem 1: 10
  • 11.
    Learnable upsampling Long etal. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015 Learnable upsampling! Slide Credit: CS231n 11
  • 12.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 12
  • 13.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 13
  • 14.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 14
  • 15.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 15
  • 16.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 16
  • 17.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 17
  • 18.
    Learnable Upsample: DeconvolutionalLayer Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 18
  • 19.
    Slide Credit: CS231n 3x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Learnable Upsample: Deconvolutional Layer 19
  • 20.
    Learnable Upsample: DeconvolutionalLayer Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Sum where output overlaps 20
  • 21.
    Learnable Upsample: DeconvolutionalLayer Warning: Checkerboard effect when kernel size is not divisible by the stride Source: distill.pub 21
  • 22.
    Learnable Upsample: DeconvolutionalLayer Source: distill.pub stride = 2, kernel_size = 3 22 Warning: Checkerboard effect when kernel size is not divisible by the stride
  • 23.
    Semantic Segmentation Slide Credit:CS231n Noh et al. Learning Deconvolution Network for Semantic Segmentation. ICCV 2015 “Regular” VGG “Upside down” VGG 23
  • 24.
    Better Upsampling: Subpixel Re-arangefeatures in previous convolutional layer to form a higher resolution output Shi et al.Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network.CVPR 2016 24
  • 25.
    Semantic Segmentation CNN Blobby-like segmentations Problem 2: High-levelfeatures (e.g. conv5 layer) from a pretrained classification network are the input for the segmentation branch 25
  • 26.
    Skip Connections Slide Credit:CS231n Skip connections = Better results “skip connections” Long et al. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015 Recovering low level features from early layers 26
  • 27.
    Dilated Convolutions Yu &Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016 Structural change in convolutional layers for dense prediction problems (e.g. image segmentation) ● The receptive field grows exponentially as you add more layers → more context information in deeper layers wrt regular convolutions ● Number of parameters increases linearly as you add more layers 27
  • 28.
    Instance Segmentation Detect instances, givecategory, label pixels “simultaneous detection and segmentation” (SDS) Slide Credit: CS231n 28
  • 29.
    Instance Segmentation More challengingthan Semantic Segmentation ● Number of objects is variable ● No unique match between predicted and ground truth objects (cannot use instance IDs) Several attack lines: ● Proposal-based methods ● Recurrent Neural Networks 29
  • 30.
    Proposal-based Slide Credit: CS231nHariharanet al. Simultaneous Detection and Segmentation. ECCV 2014 External Segment proposals Mask out background with mean image Similar to R-CNN, but with segment proposals 30
  • 31.
    Proposal-based Slide Credit: CS231nHariharanet al. Hypercolumns for Object Segmentation and Fine-grained Localization. CVPR 2015 31
  • 32.
    Proposal-based Instance Segmentation:MNC Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016 Won COCO 2015 challenge (with ResNet) Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression Mask out background, predict object class Learn entire model end-to-end! Faster R-CNN for Pixel Level Segmentation in a multi-stage cascade strategy 32
  • 33.
    Dai et al.Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016 Predictions Ground truth Proposal-based Instance Segmentation: MNC 33
  • 34.
    He et al.Mask R-CNN. arXiv Mar 2017 Proposal-based Instance Segmentation: Mask R-CNN Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks and class labels 34
  • 35.
    He et al.Mask R-CNN. arXiv Mar 2017 Mask R-CNN ● Classification & box detection losses are identical to those in Faster R-CNN ● Addition of a new loss term for mask prediction: The network outputs a K x m x m volume for mask prediction, where K is the number of categories and m is the size of the mask (square) 35
  • 36.
    He et al.Mask R-CNN. arXiv Mar 2017 Mask R-CNN: RoI Align Reminder: RoI Pool from Fast R-CNN Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w x/16 & rounding → misalignment ! + not differentiable 36
  • 37.
    Jaderberg et al.Spatial Transformer Networks. NIPS 2015 Mask R-CNN: RoI Align Use bilinear interpolation instead of cropping + maxpool 37 Mapping given by box coordinates ( 12 and 21 = 0 translation + scale)
  • 38.
  • 39.
    He et al.Mask R-CNN. arXiv Mar 2017 Mask R-CNN Object Detection Instance Segmentation 39
  • 40.
    Recurrent Instance Segmentation Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 40 Sequential mask generation
  • 41.
    Recurrent Instance Segmentation Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 41 Mapping between ground truth and predicted masks ?
  • 42.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 1-Compute the IoU for all pairs of Predicted/GT masks Ŷt Yt 42 Coverage Loss
  • 43.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 1-Compute the IoU for all pairs of Predicted/GT masks 0.9 0 0 0.1 0.8 0.1 ... ... ... ... 43 Coverage Loss
  • 44.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 2-Find best matching: Loss: Sum of the Intersections over the union for the best matching (*-1) 44 Coverage Loss
  • 45.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 3-Also take into account the scores s1 = 0.93 s2 = 0.73 s3 = 0.86 s4 = 0.63 s5 = 0.56 Where: is the binary cross entropy: is the Iverson bracket which: Is 1 if the condition is true and 0 else 45 Coverage Loss
  • 46.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 4-Add everything together 46 Coverage Loss
  • 47.
    Recurrent Instance Segmentation Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 47
  • 48.
    Summary Segmentation Datasets Semantic SegmentationMethods ● Deconvolution ● Dilated Convolution ● Skip Connections Instance Segmentation Methods ● Proposal-Based ● Recurrent 48
  • 49.