CS6501: Deep Learning for Visual Recognition
Object Detection:
RCNN, Fast-RCNN, Faster-RCNN
Today’s Class
• Object Detection
• The RCNN Object Detector (2014)
• The Fast RCNN Object Detector (2015)
• The Faster RCNN Object Detector (2016)
• YOLO (CVPR 2016)
• SSD (ECCV 2016)
Object Detection
deer
cat
Object Detection
Class Scores
Deer: 0.9
Fully Connected: Cat: 0.05
4096 to k Umbrella: 0.01
…
Fully Connected:
4096 to 4 Box Coordinates
(x, y, w, h)
Object Detection
4096 Deer: (x, y, w, h)
Cat: (x, y, w, h)
Object Detection
Penguin: (x, y, w, h)
4096 Penguin: (x, y, w, h)
Penguin: (x, y, w, h)
Penguin: (x, y, w, h)
…
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification with Sliding Window
deer?
CNN cat?
background?
Object Detection as Classification with Box Proposals
RCNN
https://people.eecs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Rich feature hierarchies for accurate object detection and semantic segmentation.
Girshick et al. CVPR 2014.
RCNN
First stage: generate category-
independent region proposals.
• 2000 Region proposals for every image
Selective Search: combine the strength of
both an exhaustive search and segmentation.
Uijlings et al. IJCV 2013.
ref
RCNN
First stage: generate category-
independent region proposals.
• 2000 Region proposals for every image
Second stage: extracts a fixed-length
feature vector from each region.
• a 4096-dimensional feature vector
from each region proposal
warp feature vector
CNN
Arbitrary rectangles? 5 conv layers + 2 fully
A fixed size input? 227 x 227 connected layers
RCNN
First stage: generate category-
independent region proposals.
• 2000 Region proposals for every image
Second stage: extracts a fixed-length
feature vector from each region.
• a 4096-dimensional feature vector
from each region proposal people?
feature vector
linear horse?
svm
Third stage: a set of class- specific background?
linear SVMs.
x
• object category and location Bounding box y
regression w
h
proposal
location
RCN Fast-
• Nand scalable.
Simple
RCNN
• improves mAP.
• A multistage pipeline.
• Training is expensive in
space and time (features
are extracted from each
region proposal in each
?
image and written into
disk).
• Object detection is slow.
Fast-RCNN
Idea: No need to recompute fea-
https://arxiv.org/abs/1504.08083 tures for every box independently
Fast R-CNN. Girshick. ICCV 2015.
Fast-RCNN
Process the whole image with several
convolutional (conv) and max pooling
layers to produce a conv feature map. a region of interest (RoI) pooling
layer extracts a fixed-length feature
vector from the region feature map.
FC+
K + 1 categories
feature vector softmax
+ four real-valued
FC+ numbers for each of
regressor the K object classes.
…
RCNN vs Fast-RCNN
Figure adapted from: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
RCN Fast- Faster-RCNN
• Nand scalable.
Simple •
RCNN
Higher mAP.
• improves mAP. • Single stage, end-to-end
training.
• No disk storage is required
• A multistage pipeline. for feature caching.
• Training is expensive in
space and time (features
are extracted from each
region proposal in each
• proposals are the
computational bottleneck
?
image and written into in detection systems.
disk).
• Object detection is slow.
Faster-RCNN
Idea: Integrate the Bounding Box
Proposals as part of the CNN predic-
tions
https://arxiv.org/abs/1506.01497
Ren et al. NIPS 2015.
Faster-RCNN
Region Proposal Networks:
k anchors boxes
2k scores 4k coordinates
object or not object bounding box proposal RPN
1x1 conv layer 1x1 conv layer
cls layer reg layer
nxn conv layer Shared conv layers
Fast-RCNN
feature map
sliding window, nxn
…
RCNN vs Fast-RCNN
Figure adapted from: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
RCN Fast- Faster-RCNN
• Nand scalable.
Simple •
RCNN
Higher mAP. • compute proposals with a
• improves mAP. • Single stage, end-to-end deep convolutional neural
training. network --Region Proposal
• No disk storage is required Network (RPN)
• A multistage pipeline. for feature caching. • merge RPN and Fast R-CNN
• Training is expensive in into a single network,
space and time (features enabling nearly cost-free
are extracted from each • proposals are the
region proposals.
region proposal in each computational bottleneck
image and written into in detection systems.
?
disk).
• Object detection is slow.
YOLO- You Only Look Once
Idea: No bounding box proposal.
A single regression problem,
straight from image pixels to
bounding box coordinates and
class probabilities.
• extremely fast
• reason globally
• learn generalizable represen-
tations
https://arxiv.org/abs/1506.02640
Redmon et al. CVPR 2016.
YOLO- You Only Look Once
Divide the image into 7x7 cells.
Each cell trains a detector.
The detector needs to predict the object’s class distributions.
The detector has 2 bounding-box predictors to predict
bounding-boxes and confidence scores.
SSD: Single Shot Detector
Idea: Similar to YOLO, but denser grid map, multiscale grid maps. + Data aug-
mentation + Hard negative mining + Other design choices in the network.
Liu et al. ECCV 2016.
Questions?