Current multimodal and multitasking base models, such as **4M** or **UnifiedIO**, show promising results. However, their out-of-the-box ability to accept different inputs and perform different tasks is limited by the (usually small) number of modalities and tasks they are trained on.
, Based on this, researchers from the Ecole Polytechnique Fédérale de Lausanne (EPFL) and Apple jointly developed an **advanced** any-to-any modality single model that is **widely** diverse in dozens of Conduct training on various modalities, and perform collaborative training on large-scale multi-modal data sets and text corpora.
A key step in the training process is to perform discrete **tokenization** on various modalities, whether they are structured data such as image-like neural network **feature maps**, vectors, instance segmentation or human poses, or Data that can be represented as text.
Paper address: https://arxiv.org/pdf/2406.09406
Paper homepage https://4m.epfl.ch/
Paper title: 4M-21: An Any -to-Any Vision Model for Tens of Tasks and Modalities
This study shows that training a single model can also complete at least **three times** as many tasks/**modalities** as existing models, and does not Performance will be lost. In addition, this research also achieves finer-grained and more controllable multi-mode data generation capabilities.
This research builds on the multi-modal mask pre-training scheme and improves model capabilities by training on dozens of highly diverse modalities. By encoding it using modality-specific discrete tokenizers, the study enables training a single unified model on different modalities.
Simply put, this research extends the capabilities of existing models in several key dimensions:
Modalities: from 7 modalities of the best existing any-to-any model to 21 different modalities , enabling cross-modal retrieval, controllable generation, and powerful out-of-the-box performance. This is the first time a single vision model can solve dozens of different tasks in an any-to-any manner without compromising performance and without any traditional multi-task learning.
Diversity: Add support for more structured data, such as human poses, SAM instances, metadata, and more.
tokenization: Study discrete tokenization of different modalities using modality-specific methods, such as global image embeddings, human poses, and semantic instances.
Extension: Expand model size to 3B parameters and dataset to 0.5B samples.
Collaborative training: collaborative training in vision and language at the same time.
Method Introduction
This study uses the 4M pre-training scheme (the study also came from EPFL and Apple and was released last year), which is proven to be a general method that can be effectively extended to multi-modality.
Specifically, this article keeps the architecture and multi-modal mask training goals unchanged, by expanding the size of the model and data sets, increasing the type and number of modalities involved in training the model, and jointly on multiple data sets Training can improve the performance and adaptability of the model.
Modalities are divided into the following categories: RGB, geometry, semantics, edge, feature map, metadata and text, as shown in the figure below.
Tokenization
Tokenization mainly includes converting different modalities and tasks into sequences or discrete tokens, thereby unifying their representation spaces. Researchers use different tokenization methods to discretize modes with different characteristics, as shown in Figure 3. In summary, this article uses three tokenizers, including ViT tokenizer, MLP tokenizer and text tokenizer.
In terms of architecture selection, this article adopts the 4M encoder-decoder architecture based on Transformer, and adds additional modal embeddings to adapt to new modalities.
Experimental results
Next, the paper demonstrates the multi-modal capabilities of 4M-21.
Multi-modal generation
Based on iterative decoding token, 4M-21 can be used to predict any training modality. As shown in Figure 2, this paper can generate all modalities in a consistent manner from a given input modality.
Furthermore, since this study can conditionally and unconditionally generate any training modality from any subset of other modalities, it supports several methods to perform fine-grained and multi-modal generation, as shown in Figure 4, For example, perform multimodal editing. Furthermore, 4M-21 demonstrates improved text understanding, both on T5-XXL embeddings and regular subtitles, enabling geometrically and semantically sound generation (Figure 4, top right).
Multi-modal retrieval
As shown in Figure 5, 4M-21 unlocks retrieval capabilities that are not possible with the original DINOv2 and ImageBind models, such as retrieving RGB images or other modalities by using other modalities as queries . In addition, 4M-21 can combine multiple modalities to predict global embeddings for better control of retrieval, as shown in the image on the right.
Out of the box
The 4M-21 is capable of performing a range of common vision tasks out of the box, as shown in Figure 6.
Table 1 evaluates DIODE surface normal and depth estimation, COCO semantic and instance segmentation, 3DPW 3D human pose estimation, etc.
Transfer experiment
In addition, this article also trained models of three different sizes: B, L and XL. Their encoder is then transferred to downstream tasks and evaluated on single-modality (RGB) and multi-modality (RGB + depth) settings. All transfer experiments discard the decoder and instead train a task-specific head. The results are shown in Table 2:
Finally, this paper performs multi-modal transfer on NYUv2, Hypersim semantic segmentation and 3D object detection on ARKitScenes. As shown in Table 3, 4M-21 takes full advantage of the optional depth input and significantly improves the baseline.
The above is the detailed content of Too complete! Apple launches new visual model 4M-21, capable of 21 modes. For more information, please follow other related articles on the PHP Chinese website!