The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames-AI-php.cn

Table of Contents

Method Overview

Experimental results

Home

Technology peripherals

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 13, 2023 am 08:34 AM

Model Detection

Diffusion Model (Diffusion Model), as a new SOTA in deep generation models, has surpassed the original SOTA in image generation tasks: such as GAN, and has excellent performance in many application fields, such as computer vision, NLP, molecular graph modeling, time series modeling, etc.

Recently, Luo Ping's team from the University of Hong Kong and researchers from Tencent AI Lab jointly proposed a new framework DiffusionDet, which applies the diffusion model to target detection. As far as we know, there is no research that can successfully apply the diffusion model to target detection. It can be said that this is the first work to use the diffusion model for target detection.

What is the performance of DiffusionDet? Evaluated on the MS-COCO data set, using ResNet-50 as the backbone, under a single sampling step, DiffusionDet achieves 45.5 AP, significantly better than Faster R-CNN (40.2 AP), DETR (42.0 AP), and comparable to Sparse R-CNN (45.0 AP) is equivalent. By increasing the number of sampling steps, the DiffusionDet performance is further improved to 46.2 AP. In addition, DiffusionDet also performed well on the LVIS dataset, achieving 42.1 AP using swing-base as the backbone.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Paper address: https://arxiv.org/pdf/2211.09788.pdf
Project address https://github.com/ShoufaChen/DiffusionDet

This study found that in traditional target detection There is a drawback in that they rely on a fixed set of learnable queries. Then researchers wondered: Is there a simple way to do object detection that doesn't even require learnable queries?

In order to answer this question, this article proposes DiffusionDet, a framework that can detect targets directly from a set of random boxes. It formulates target detection as a process from the noise box to the target box. noise diffusion process. This noise-to-box approach does not require heuristic target priors nor learnable queries, which further simplifies target candidates and advances detection pipelines.

As shown in Figure 1 below, this study believes that the noise-to-box paradigm is similar to the noise-to-image process in the denoising diffusion model, which is a type of likelihood-based process. The model uses the learned denoising model to gradually remove the noise in the image to generate the image.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

DiffusionDet solves the target detection task through the diffusion model, that is, the detection is regarded as the position (center coordinates) and size (width and height) of the bounding box in the image Spatial generation tasks. In the training phase, Gaussian noise controlled by the variance table (schedule) is added to the ground truth box to obtain the noise box. These noisy boxes are then used to crop regions of interest (RoI) from the output feature maps of backbone encoders (such as ResNet, Swin Transformer). Finally, these RoI features are sent to the detection decoder, which is trained to predict the ground truth box without noise. In the inference phase, DiffusionDet generates bounding boxes by inverting the learned diffusion process, which adjusts the noise prior distribution to the learned distribution on the bounding box.

Method Overview

Because the diffusion model iteratively generates data samples, the model f_θ needs to be run multiple times during the inference phase. However, applying f_θ directly on the original image at each iteration step is computationally difficult. Therefore, the researchers proposed to divide the entire model into two parts, namely the image encoder and the detection decoder. The former is run only once to extract the depth feature representation from the original input image Progressively refine box predictions in z_t.

The image encoder takes a raw image as input and extracts its high-level features for the detection decoder. Researchers use convolutional neural networks such as ResNet and Transformer-based models such as Swin to implement DiffusionDet. Meanwhile, feature pyramid networks are used to generate multi-scale feature maps for ResNet and Swin backbone networks.

The detection decoder borrows from Sparse R-CNN, takes a set of proposal boxes as input, crops RoI features from the feature map generated by the image encoder, and sends them to the detection head to obtain box regression and classification result. Furthermore, the detection decoder consists of 6 cascaded stages.

Training

In the training process, the researcher first constructed the diffusion from the ground truth box to the noise box process, and then train the model to reverse this process. Algorithm 1 below provides the pseudocode of the DiffusionDet training process.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Truth box filling. For modern object detection benchmarks, the number of instances of interest often varies from image to image. Therefore, we first fill some additional boxes to the original ground truth boxes so that all boxes are summed up to a fixed number N_train. They explored several filling strategies, such as repeating existing ground-truth boxes, concatenating random boxes, or image-sized boxes.

Frame is damaged. The researcher adds Gaussian noise to the filled ground truth box. The noise scale is controlled by α_t in the following formula (1), which adopts monotonically decreasing cosine scheduling at different time steps t.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Training loss. The detection decoder takes N_train corrupted boxes as input and predicts N_train predictions of class classification and box coordinates. Also apply set prediction loss on the N_train prediction set.

Inference

The inference process of DiffusionDet is a denoising sampling process from noise to target frame. Starting from a box sampled from a Gaussian distribution, the model gradually refines its predictions as shown in Algorithm 2 below.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Sampling steps. At each sampling step, random boxes or estimated boxes from the previous sampling step are sent to the detection decoder to predict class classification and box coordinates. After obtaining the box of the current step, DDIM is employed to estimate the box of the next step.

Box updates. To make inference better consistent with training, we propose a box updating strategy to recover unexpected boxes by replacing them with random boxes. Specifically, they first filter out unexpected boxes with scores below a certain threshold, and then concatenate the remaining boxes with new random boxes sampled from a Gaussian distribution.

Once-for-all. Thanks to the randomized box design, researchers can evaluate DiffusionDet using any number of random boxes and sampling steps. For comparison, previous methods rely on the same number of processing boxes during training and evaluation, and the detection decoder is used only once in the forward pass.

Experimental results

In the experimental part, the researcher first demonstrated the Once-for-all property of DiffusionDet, and then compared DiffusionDet with previous data in MS-COCO and LVIS. A collection of mature detectors for comparison.

The main feature of DiffusionDet is to train all inference instances once. Once the model is trained, it can be used to change the number of boxes and sample steps in inference, as shown in Figure 4 below. DiffusionDet can achieve higher accuracy by using more boxes or/and more refinement steps, but at the cost of higher latency. Therefore, we deployed a single DiffusionDet to multiple scenarios and achieved the desired speed-accuracy trade-off without retraining the network.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

The researchers compared DiffusionDet with previous detectors on the MS-COCO and LVIS data sets, as shown in Table 1 below. They first compared the object detection performance of DiffusionDet with previous detectors on MS-COCO. The results show that DiffusionDet without the refinement step achieves 45.5 AP using the ResNet-50 backbone network, surpassing previous mature methods such as Faster R-CNN, RetinaNet, DETR and Sparse R-CNN by a large margin. And DiffusionDet shows stable improvement when the size of the backbone network is enlarged.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Table 2 below shows the results on the more challenging LVIS data set. It can be seen that DiffusionDet uses more details. ization step can achieve significant gains.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

For more experimental details, please refer to the original paper.

The above is the detailed content of The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7579

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

111

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! Mar 21, 2024 pm 05:21 PM

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.

See all articles