YOLO : Real Time Object Detection

We sees an image and instantly know what objects are in the image and where they are unlike machines. The human visual system is fast and accurate. Every AI researcher is struggling to find an efficient method for real time object detection. In 2015 researchers from Allen institute for AI, University of Washington, and Facebook came together and developed the fastest object detection model, YOLO ( You Only Look Once ).

Object detection is a general term to describe a collection of related computer vision and image processing tasks that involve identifying objects in given frame. It is widely use for face recognition, applications like tracking the ball during football match, image annotation, etc.

YOLO is totally new approach to detect objects in given frame than traditional models. Earlier models were able to recognize objects but fails to locate them.

YOLO is complex convolutional neural network which applies single neural network and predicts bounding boxes around the objects and class probabilities directly from full images in one scan.

What author said is they frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.

Yolo Model image
Let’s discuss this in detail
Model Architecture
Yolo Model architecture

YOLO architecture is inspire by Inception image classification model and trained on ImageNet data. YOLO consist of 24 convolutional layers.

Alternating 1 × 1 convolutional layers reduce the features space from preceding layers.  For the last convolution layer, it outputs a tensor with shape (7, 7, 1024). The tensor is then flattened.

Then Used 2 fully connected layers as a form of linear regression, it outputs 7×7×30 parameters.

This network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means it can process streaming video in real-time with less than 25 milliseconds of latency.


YOLO network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. It divides the image into S x S grid. Then it looks for the center of an object and if it falls in the grid cell, then that grid cell is responsible for detecting that object.

Then each grid cell does the prediction of bounding box around the object along with confidence scores for those boxes. “Confidence box tells how confident the model is that the box contain an object.” Confidence box returns zero if grid cell does not contain any object.

Confidence box is nothing but the bounding box which locates the object in given image as mentioned. Each bounding box can be described using five descriptors: X, Y, W, H and confidence. (X, Y) is the centre co-ordinate of bounding box and W, H is the width and height. Finally the confidence prediction represents the IOU (intersection over union) between the predicted box and any ground truth box.

Simultaneously, it also calculates the conditional probability of object given class P(class | object). And, then it multiplies with individual confidence box that gives the final answer i.e. It detects object.

From the above discussion and figure, we can see that the model is running on full image and detecting image simultaneously. which makes it much faster than any other model.

  • In YOLO model, each grid is able to predict only two boxes, which make it harder to predict small objects that appears in group.
  • It struggles to generalize to objects in new or unusual aspect ratios or configurations.

I will post object detection code plus performance comparison of YOLO with other models in upcoming blog. Stay tuned!!!!

Leave a Reply