The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com.
With the development of artificial intelligence, language models and generative models have achieved a lot of success and in the process of designing the model, the number of parameters of the model It’s also getting bigger. For fine-grained understanding tasks, the number of model parameters is also increasing. However, there is a contradiction between scale and accuracy in existing data sets. For example, 99.1% of the masks in the SA-1B data set are machine-generated, but there are no semantic labels. Some other public data sets also have accuracy problems, and these The size of the data set is generally relatively small. Recently, ByteDance has proposed a new generation of fine-grained understanding data sets. In response to the design needs of contemporary deep learning models, a total of 383K images have been panoramic The manual annotation of segmentation finally reached 5.18M masks, which is the largest panoramic segmentation understanding data set with manual labels so far, named COCONut. This result has been selected for CVPR2024.
- Paper link: https://arxiv.org/abs/2404.08639
- Code and data Set link: https://xdeng7.github.io/coconut.github.io/
The video shows the mask of a single image of COCONut From the statistics of density and semantic categories, it can be seen that the semantics of the data set are rich and the mask segmentation granularity is fine. This dataset also supports a variety of understanding tasks, such as panoramic segmentation, instance segmentation, semantic segmentation, object detection, semantically controlled generation, and open vocabulary segmentation. On multiple tasks, significant performance improvements are achieved just by replacing the dataset.
Usually only using manual annotation is very expensive, this is also An important reason why most existing public data sets cannot grow in size. There are also some data sets that directly use labels generated by the model, but often such generated labels will not greatly improve the training of the model. This article also verifies this. Therefore, this paper proposes a novel annotation method, combined with manual semi-automatic label generation. It can not only ensure the accuracy of data annotation, but also save the cost of manual labor, while also accelerating the annotation process.
Comparison of labeling accuracyThe researcher put COCONut and COCO on the same picture annotations for comparison. From the comparison in the figure below, we can see that the annotation method proposed in this article achieves almost the same accuracy as purely manual annotation using Photoshop, but the annotation speed is increased by more than 10 times.
and Compared with the existing COCO data set, the distribution of each category in the data set is relatively similar, but the total number of masks in each picture exceeds the COCO data set, especially when there are a large number of single pictures with more than 100 masks. This shows that COCONut's annotation is more refined and its granular segmentation is more intensive.
Experimental verificationIn addition to proposing a better training set, the researchers also found that the existing verification set cannot reflect the model well performance improvement, so this article also proposes a more challenging test set that can reflect the improvement of the model, named COCONut-val. As can be seen from the table below, by only replacing the data set, a higher-precision training set can It brings great improvements to the model, such as reaching a PQ of more than 4 points in panoramic segmentation. However, when the size of the training set increases, it can be found that testing with the existing test set does not reflect the improvement of the model, while COCONut-val can reflect that the model still has obvious improvements after increasing the amount of training set data. promote.
The following figure shows a comparison of the semantic categories and mask density of the verification set. It can be seen that the newly proposed verification set is more challenging and can better reflect the improvement of the model.
For more experimental results, please refer to the original paper. The team will provide the data set and corresponding model for public download on the GitHub homepage. Bytedance Intelligent Creation Team##Intelligent Creation The team is Bytedance's AI & multimedia technology team, covering computer vision, audio and video editing, special effects processing and other technical fields. With the help of the company's rich business scenarios, infrastructure resources and technical collaboration atmosphere, it has realized cutting-edge algorithms - engineering systems - products The full-link closed loop aims to provide the company's internal businesses with cutting-edge content understanding, content creation, interactive experience and consumption capabilities and industry solutions in various forms.
Currently, the intelligent creation team has opened its technical capabilities and services to enterprises through Volcano Engine, a cloud service platform owned by ByteDance. More positions related to large model algorithms are opening.
The above is the detailed content of CVPR 2024 | Byte proposes a new generation of data set COCONut, which is denser than COCO granular segmentation. For more information, please follow other related articles on the PHP Chinese website!