Currently leading object detectors are two-stage or single-stage networks based on repurposed backbone classifier networks of deep CNNs. YOLOv3 is one such well-known state-of-the-art single-stage detector that receives an input image and divides it into an equal-sized grid matrix. Grid cells with target centers are responsible for detecting specific targets.
What I shared today is to propose a new mathematical method, which provides a solution for each goal Allocate multiple grids for accurate tight-fit bounding box predictions. The researchers also proposed an effective offline copy-paste data enhancement for target detection. The newly proposed method significantly outperforms some current state-of-the-art object detectors and promises better performance.
Object detection networks aim to locate objects on images and accurately label them using precision matching bounding boxes. Recently, there have been two different ways to achieve this. The first method is in terms of performance. The most important method is two-stage object detection. The best representative is regional convolutional neural network (RCNN) and its derivatives [Faster R-CNN: Towards real-time object detection with region proposal] networks], [Fast R-CNN]. In contrast, the second group of object detection implementations are known for their excellent detection speed and lightweight, and are called single-stage networks. A representative example is [You only look once: Unified, real-time object detection], [SSD: Single shot multibox detector], [Focal loss for dense object detection]. The two-stage network relies on a latent region proposal network that generates candidate regions of the image that may contain objects of interest. The candidate regions generated by this network can contain the object's region of interest. In single-stage object detection, detection is handled simultaneously with classification and localization in a complete forward pass. Therefore, single-stage networks are typically lighter, faster, and easier to implement.
Today’s research still adheres to the YOLO method, especially YOLOv3, and proposes a simple hack that can use multiple networks at the same time Cell elements predict target coordinates, categories, and target confidence. The rationale behind multi-network unit elements per object is to increase the likelihood of predicting closely fitting bounding boxes by forcing multiple unit elements to work on the same object.
Some advantages of multi-grid allocation include:
The object detector provides the Multi-view maps of objects, rather than relying solely on one grid cell to predict the object's category and coordinates.
(b) Less random and uncertain bounding box predictions, which means high precision and recall because nearby network units are trained to predict the same object category and coordinates ;
(c) Reduce the imbalance between grid cells with objects of interest and grids without objects of interest.
Furthermore, since multi-grid allocation is a mathematical utilization of existing parameters, and no additional keypoint pooling layer and post-processing are required to recombine keypoints to their corresponding For targets such as CenterNet and CornerNet, it can be said that it is a more natural way to achieve what anchor-free or keypoint-based object detectors are trying to achieve. In addition to multi-grid redundant annotations, the researchers also introduced a new offline copy-paste based data enhancement technology for accurate object detection.
The above picture contains three targets, namely dogs and bicycles and cars. For the sake of brevity, we will explain our multi-grid assignment on one object. The image above shows the bounding boxes of three objects, with more detail on the dog's bounding box. The image below shows a zoomed-out area of the image above, focusing on the center of the dog's bounding box. The top-left coordinate of the grid cell containing the center of the dog's bounding box is labeled with the number 0, while the other eight grid cells surrounding the grid containing the center have labels from 1 to 8.
So far I have explained the basic facts of how a grid containing the center of an object's bounding box annotates an object. This reliance on only one grid cell per object to do the difficult job of predicting categories and precise tight-fit bounding boxes raises many questions, such as:
(a) Huge imbalance between positive and negative grids, i.e. grid coordinates with and without object center
(b) Slow bounding box convergence to GT
(c) Lack of multi-perspective (angle) views of the object to be predicted.
So a natural question to ask here is, "Obviously, most objects contain areas of more than one grid cell, so is there a simple mathematical way to allocate more of these grid cell to try to predict the object's category and coordinates along with the center grid cell?" Some advantages of this are (a) reduced imbalance, (b) faster training to converge to bounding boxes since now multiple grid cells target the same object simultaneously, (c) increased prediction of tight-fit bounding boxes Opportunity (d) provides grid-based detectors such as YOLOv3 with multi-view views instead of single-point views of objects. The newly proposed multigrid allocation attempts to answer the above questions.
Ground-truth encoding
A. The Detection Network: MultiGridDet
MultiGridDet is an object detection network made lighter by removing six darknet convolution blocks from YOLOv3 , faster. A convolution block has a Conv2D Batch Normalization LeakyRelu. The removed blocks are not from the classification backbone, i.e. Darknet53. Instead, remove them from three multiscale detection output networks or heads, two from each output network. Although deep networks generally perform well, networks that are too deep also tend to quickly overfit or significantly slow down the network.
B. The Loss function
##Coordinate activation function plot with different β valuesC. Data Augmentation
Offline copy-paste manual training image synthesis works as follows: First, use a simple image search Script that downloads thousands of background object-free images from Google Images using keywords such as landmark, rain, forest, etc., i.e. images without the object of our interest. We then iteratively select p objects and their bounding boxes from random q images of the entire training dataset. We then generate all possible combinations of p bounding boxes selected using their indices as IDs. From the combined set, we select a subset of bounding boxes that satisfy the following two conditions:
##Performance comparison on coco data set
##As can be seen from the figure, The first row shows the six input images, while the second row shows the network’s predictions before non-maximal suppression (NMS), and the last row shows MultiGridDet’s final bounding box predictions for the input images after NMS.
The above is the detailed content of Multi-grid redundant bounding box annotation for accurate object detection. For more information, please follow other related articles on the PHP Chinese website!