Object Detection
Object Detection
✓ boat
✓ person
Image Classification Object Detection
(what?) (what + where?)
Computer vision Tasks
Comparing Boxes: Intersection over Union
(IoU)
Comparing Boxes: Intersection over Union
(IoU)
Region-based Convolutional Neural Network (RCNN )
• Instead of working on a massive number of regions, the RCNN algorithm proposes
a bunch of boxes in the image and checks if any of these boxes contain any object.
RCNN uses selective search to extract these boxes from an image (these boxes are called
regions).
First Selective Search • Then combines the similar
• It first takes an image as input • Then, it generates initial sub-
regions to form a larger region
segmentations so that we
(based on color similarity, texture
have multiple regions from
similarity, size similarity, and
this image:
shape compatibility)
RCNN- PROBLEMS
• Extracting 2,000 regions for each image based on a selective search
• Extracting features using CNN for every image region. Suppose we have N images,
then the number of CNN features will be N*2,000
• The entire process of object detection using RCNN has three models:
• CNN for feature extraction
• Linear SVM classifier for identifying objects.
• Regression model for tightening the bounding boxes.
• All these processes combine to make RCNN very slow. It takes around 40-50 seconds to
make predictions for each new image
RCNN-PROBLEM
• As CNN is followed by fully connected layers which can accept
input of fixed size.
• This makes CNN incapable of accepting varied size inputs. Thus,
images are first reshaped into some specific dimension before
feeding into CNN.
• This creates another issue of image warping and reduced
resolution. Spatial Pyramid pooling comes as a counter to this
problem.
What’s wrong with SPP-net?
• Training is still Slow( though better).
• Introduces a new problem: cannot update parameters below SPP layer during
training.
FAST-RCNN
• Instead of running a CNN 2,000 times per image, we can run it just once per image and get
all the regions of interest (regions containing some object).
• In Fast RCNN, we feed the input image to the CNN, which in turn generates the
convolutional feature maps.
• Using these maps, the regions of proposals are extracted.
• We then use a RoI pooling layer to reshape all the proposed regions into a fixed size, so that
it can be fed into a fully connected network.
• A SoftMax layer is used on top of the fully connected network to output classes. Along with
the SoftMax layer, a linear regression layer is also used parallelly to output bounding box
coordinates for predicted classes.
Cropping Features: RoI Pool
Model takes an image input of size 512x512x3 (width x height x RGB) and VGG16 is
mapping it into a 16x16x512 feature map.
Note that the Output’s width and height are exactly 32 times smaller than the input
image (512/32 = 16). That’s important because all RoIs must be scaled down by this
factor.
Cropping Features: RoI Pool
• Its original size is 145x200 and the top left corner is
set to be in (192x296). As you could probably tell,
we’re not able to divide most of those numbers by 32.
• width: 200/32 = 6.25
• height: 145/32 = ~4.53
• x: 296/32 = 9.25
• y: 192/32 = 6
Cropping Features: RoI Pool
After RoI Pooling Layer there is a Fully Connected
layer with a fixed size. Because our Roi's have different
sizes we have to pool them into the same size
(3x3x512 in our example). At this moment our mapped
RoI is a size of 4x6x512 and as you can imagine
we cannot divide 4 by 3.
Problems with Fast RCNN
• Fast RCNN has certain problem areas.
• It also uses selective search as a
proposed method to find the Regions
of Interest, which is a slow and time-
consuming process.
• It takes around 2 seconds per image
to detect objects, which is much
better compared to RCNN.
• But when we consider large real-life
datasets, then even a Fast RCNN
doesn’t look so fast anymore.
Faster-RCNN
• Faster RCNN is the modified version of Fast RCNN. The major difference
between them is that Fast RCNN uses the selective search for generating
Regions of Interest.
• We extract a descriptor
per location
YOLO (You Only Look Once!)
• YOLO is a real-time object detection algorithm. It was developed by Joseph
Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi at the University of
Washington (2015).
• Yolo is extremely fast because it passes the entire image at once into a CNN,
rather than making predictions on many individual regions of the image.
• The key idea behind YOLO is to use a single neural network to predict the
bounding boxes and class probabilities for objects in an image
YOLO (You Only Look Once!)
• YOLO divides the input image into a grid of cells and predicts the presence of objects in
each cell.
• If an object is detected in a cell, the algorithm also predicts the bounding box and the
class for the object.
• The bounding box coordinates and class probabilities are then used to localize and
classify the objects.
Each object in training
image is assigned to
grid cell that contains
that object’s midpoint.
YOLO –Anchor Boxes
• One of the Caveats of YOLO is that it can’t detect multiple objects in same grid.
• Solution: Anchor boxes. It is a predefined bounding box used in object detection
algorithms.
• The anchor box is used to define the size and aspect ratio of the window, and it is defined
prior to training the object detection model. The model is then trained to predict the
bounding box coordinates and class probabilities for objects relative to the anchor box.
Each object in training Per grid target label
image is assigned to
grid cell that contains
object’s midpoint and
anchor box for the grid
cell with highest IoU.
Putting it together: YOLO algorithm
Two anchor boxes used
Outputting the non-max suppressed outputs
Detection evaluation
mAP formula is based on the following sub metrics:
• Confusion Matrix,
• Intersection over Union(IoU),
• Recall,
• Precision
Confusion Matrix
• To create a confusion matrix, we need four attributes:
• True Positives (TP): The model predicted a label and matches correctly as per ground
truth.
• True Negatives (TN): The model does not predict the label and is not a part of the
ground truth.
• False Positives (FP): The model predicted a label, but it is not a part of the ground
truth.
• False Negatives (FN): The model does not predict a label, but it is part of the ground
truth.
Confusion Matrix
Detection evaluation
• Precision measures how many of the “positive” predictions
made by the model were correct.
• Recall measures how many of the positive class samples
present in the dataset were correctly identified by the model.
• Precision and recall offer a trade-off, i.e., one metric comes
at the cost of another.
mAP
• The mAP is calculated by finding Average Precision(AP) for each class and then average over a
number of classes.