Authors: Chu Yi, Kai Heng, etc.
Recently, Meituan’s Visual Intelligence Department has developed YOLOv6, a target detection framework dedicated to industrial applications, which can focus on detection accuracy and reasoning efficiency at the same time. During the research and development process, the Visual Intelligence Department continued to explore and optimize, while drawing on some cutting-edge developments and scientific research results from academia and industry. Experimental results on COCO, the authoritative target detection data set, show that YOLOv6 surpasses other algorithms of the same size in terms of detection accuracy and speed. It also supports the deployment of a variety of different platforms, greatly simplifying the adaptation work during project deployment. This is open source, hoping to help more students.
YOLOv6 is a target detection framework developed by Meituan’s Visual Intelligence Department and is dedicated to industrial applications. This framework focuses on both detection accuracy and inference efficiency. Among the commonly used size models in the industry: YOLOv6-nano has an accuracy of up to 35.0% AP on COCO and an inference speed of on T4. 1242 FPS; YOLOv6-s can achieve an accuracy of 43.1% AP on COCO, and an inference speed of 520 FPS on T4. In terms of deployment, YOLOv6 supports the deployment of different platforms such as GPU (TensorRT), CPU (OPENVINO), ARM (MNN, TNN, NCNN), which greatly Simplify the adaptation work during project deployment. Currently, the project has been open sourced to Github, portal: YOLOv6. Friends who are in need are welcome to Star to collect it and access it at any time.
Object detection, as a basic technology in the field of computer vision, has been widely used in the industry , among which the YOLO series algorithms have gradually become the preferred framework for most industrial applications due to their good comprehensive performance. So far, the industry has derived many YOLO detection frameworks, among which YOLOv5[1], YOLOX[2] and PP-YOLOE[3] are the most representative. performance, but in actual use, we found that the above framework still has a lot of room for improvement in terms of speed and accuracy. Based on this, we developed a new target detection framework-YOLOv6 by researching and drawing on existing advanced technologies in the industry. The framework supports the full chain of industrial application requirements such as model training, inference and multi-platform deployment, and has made a number of improvements and optimizations at the algorithm level such as network structure and training strategies. On the COCO data set, YOLOv6 has both accuracy and speed. Surpassing other algorithms of the same size, the relevant results are shown in Figure 1 below:
##Figure 1-1 YOLOv6 model performance of each size and other models Comparison
Figure 1-2 Performance comparison between YOLOv6 and other models at different resolutionsFigure 1-1 shows the performance comparison of each detection algorithm under different size networks. The points on the curve respectively represent the performance of the detection algorithm under different size networks (s /tiny/nano) model performance, as can be seen from the figure, YOLOv6 surpasses other YOLO series algorithms of the same size in terms of accuracy and speed. Figure 1-2 shows the performance comparison of each detection network model when the input resolution changes. The points on the curve from left to right represent when the image resolution increases sequentially (384/448/512/576 /640) The performance of this model, as can be seen from the figure, YOLOv6 still maintains a large performance advantage under different resolutions.
2. Introduction to key technologies of YOLOv6YOLOv6 has made many improvements mainly in Backbone, Neck, Head and training strategies:
The Backbone and Neck used by YOLOv5/YOLOX are both built based on CSPNet[5] , using a multi-branch approach and residual structure. For hardware such as GPUs, this structure will increase latency to a certain extent and reduce memory bandwidth utilization. Figure 2 below is an introduction to the Roofline Model[8] in the field of computer architecture, showing the relationship between computing power and memory bandwidth in hardware.
Figure 2 Roofline Model introduction diagram
So, we based on the idea of hardware-aware neural network design , Backbone and Neck have been redesigned and optimized. This idea is based on the characteristics of the hardware and the characteristics of the inference framework/compilation framework. It takes the hardware and compilation-friendly structure as the design principle. When building the network, it comprehensively considers the hardware computing power, memory bandwidth, compilation optimization characteristics, network representation capabilities, etc., and then Get fast and good network structure. For the above two redesigned detection components, we call them EfficientRep Backbone and Rep-PAN Neck respectively in YOLOv6. Their main contributions are:
RepVGG[4] Style structure is a multi-branch topology during training, which can be equivalently fused into a single 3x3 during actual deployment A reparameterizable structure of convolution (The fusion process is shown in Figure 3 below). Through the fused 3x3 convolution structure, the computing power of computationally intensive hardware (such as GPU) can be effectively utilized, and the help of the highly optimized NVIDIA cuDNN and Intel MKL compilation frameworks on GPU/CPU can also be obtained. .
Experiments show that through the above strategy, YOLOv6 reduces the hardware delay and significantly improves the accuracy of the algorithm, making the detection network faster and stronger. Taking the nano-size model as an example, compared with the network structure used by YOLOv5-nano, this method improves the speed by 21% and increases the accuracy by 3.6% AP.
Figure 3 Fusion process of Rep operator[4]
EfficientRep Backbone: In terms of Backbone design, we designed an efficient Backbone based on the above Rep operator. Compared with the CSP-Backbone used by YOLOv5, this Backbone can efficiently utilize the computing power of hardware (such as GPU) and also has strong representation capabilities.
Figure 4 below is the specific design structure diagram of EfficientRep Backbone. We replaced the ordinary Conv layer with stride=2 in Backbone with the RepConv layer with stride=2. At the same time, the original CSP-Block is redesigned into RepBlock, where the first RepConv of RepBlock will transform and align the channel dimension. Additionally, we optimize the original SPPF into a more efficient SimSPPF.
Figure 4 EfficientRep Backbone structure diagram
Rep-PAN: In terms of Neck design, in order to make its reasoning on hardware more efficient and achieve a better balance between accuracy and speed, we designed a more effective feature fusion network structure for YOLOv6 based on the hardware-aware neural network design idea.
Rep-PAN is based on the PAN[6] topology, using RepBlock to replace the CSP-Block used in YOLOv5, and at the same time adjusting the operators in the overall Neck, with the purpose of While achieving efficient inference on the hardware, it maintains good multi-scale feature fusion capabilities (Rep-PAN structure diagram is shown in Figure 5 below).
Figure 5 Rep-PAN structure diagram
In YOLOv6, we adopt the decoupled detection head (Decoupled Head) structure and streamline its design. The detection head of the original YOLOv5 is implemented by merging and sharing the classification and regression branches, while the detection head of YOLOX decouples the classification and regression branches, and adds two additional 3x3 convolutional layers. Although The detection accuracy is improved, but the network delay is increased to a certain extent.
Therefore, we streamlined the design of the decoupling head, taking into account the balance between the representation capabilities of the relevant operators and the computational overhead of the hardware, and redesigned it using the Hybrid Channels strategy A more efficient decoupling head structure is developed, which reduces the delay while maintaining accuracy, and alleviates the additional delay overhead caused by the 3x3 convolution in the decoupling head. By conducting ablation experiments on a nano-size model and comparing the decoupling head structure with the same number of channels, the accuracy is increased by 0.2% AP and the speed is increased by 6.8%.
Figure 6 Efficient Decoupled Head structure diagram
In order to further improve detection accuracy, we draw on advanced research progress from other detection frameworks in academia and industry: Anchor-free anchor-free paradigm, SimOTA label allocation strategy and SIoU bounding box regression loss.
Anchor-free anchor-free paradigm
YOLOv6 adopts a more concise Anchor-free detection method. Since Anchor-based detectors need to perform cluster analysis before training to determine the optimal Anchor set, this will increase the complexity of the detector to a certain extent; at the same time, in some edge-end applications, a large number of detection results need to be transported between hardware steps will also bring additional delays. The Anchor-free anchor-free paradigm has been widely used in recent years due to its strong generalization ability and simpler decoding logic. After experimental research on Anchor-free, we found that compared to the additional delay caused by the complexity of the Anchor-based detector, the Anchor-free detector has a 51% improvement in speed.
SimOTA label allocation strategy
In order to obtain more high-quality positive samples, YOLOv6 introduced SimOTA [4]The algorithm dynamically allocates positive samples to further improve detection accuracy. The label allocation strategy of YOLOv5 is based on Shape matching, and increases the number of positive samples through the cross-grid matching strategy, thereby allowing the network to converge quickly. However, this method is a static allocation method and will not be adjusted along with the network training process.
In recent years, many methods based on dynamic label assignment have emerged. Such methods will allocate positive samples based on the network output during the training process, thereby producing more high-quality Positive samples, in turn, promote forward optimization of the network. For example, OTA[7] models sample matching as an optimal transmission problem and obtains the best sample matching strategy under global information to improve accuracy. However, OTA uses the Sinkhorn-Knopp algorithm, resulting in training The time is lengthened, and the SimOTA[4] algorithm uses the Top-K approximation strategy to obtain the best match of the sample, which greatly speeds up the training. Therefore, YOLOv6 adopts the SimOTA dynamic allocation strategy and combines it with the anchor-free paradigm to increase the average detection accuracy by 1.3% AP on the nano-size model.
SIoU bounding box regression loss
In order to further improve the regression accuracy, YOLOv6 adopts SIoU[9 ] Bounding box regression loss function to supervise the learning of the network. The training of target detection networks generally requires the definition of at least two loss functions: classification loss and bounding box regression loss, and the definition of the loss function often has a greater impact on detection accuracy and training speed.
In recent years, commonly used bounding box regression losses include IoU, GIoU, CIoU, DIoU loss, etc. These loss functions consider factors such as the degree of overlap between the prediction frame and the target frame, center point distance, aspect ratio, etc. To measure the gap between the two, thereby guiding the network to minimize the loss to improve regression accuracy, but these methods do not take into account the matching of the direction between the prediction box and the target box. The SIoU loss function redefines distance loss by introducing the vector angle between required regressions, effectively reducing the degree of freedom of regression, accelerating network convergence, and further improving regression accuracy. By using SIoU loss for experiments on YOLOv6s, compared with CIoU loss, the average detection accuracy is increased by 0.3% AP.
After the above optimization strategies and improvements, YOLOv6 has achieved excellent performance in multiple models of different sizes. Table 1 below shows the ablation experimental results of YOLOv6-nano. From the experimental results, we can see that our self-designed detection network has brought great gains in both accuracy and speed.
Table 1 YOLOv6-nano ablation experimental resultsTable 2 below shows the experimental results of YOLOv6 compared with other currently mainstream YOLO series algorithms. You can see from the table:
Table 2 Comparison of the performance of YOLOv6 models of various sizes with other models
This article introduces the optimization and practical experience of Meituan Visual Intelligence Department in the target detection framework , we have thought and optimized the training strategy, backbone network, multi-scale feature fusion, detection head, etc. for the YOLO series framework, and designed a new detection framework-YOLOv6. The original intention came from solving the actual problems encountered when implementing industrial applications. question.
While building the YOLOv6 framework, we explored and optimized some new methods, such as self-developed EfficientRep Backbone, Rep-Neck and Efficient Decoupled Head based on hardware-aware neural network design ideas. , and also draws on some cutting-edge developments and results in academia and industry, such as Anchor-free, SimOTA and SIoU regression loss. Experimental results on the COCO data set show that YOLOv6 is among the best in terms of detection accuracy and speed.
In the future, we will continue to build and improve the YOLOv6 ecosystem. The main work includes the following aspects:
The above is the detailed content of YOLOv6's fast and accurate target detection framework has been open sourced. For more information, please follow other related articles on the PHP Chinese website!