Table of Contents
From SOTA model development practice
Language models have a greater impact on overall performance
Select the architecture type according to your needs
Experience in the training phase
Data diversity and processing strategies
Home Technology peripherals AI HuggingFace teaches you how to make a SOTA visual model

HuggingFace teaches you how to make a SOTA visual model

Jun 05, 2024 pm 09:39 PM
Model Vision sota

There was OpenAI’s GPT-4o in the past, and Google’s series of kings later, and advanced multi-modal large models exploded one after another.

Other practitioners were shocked and began to think about how to catch up with these super models again.

In this paper by HuggingFace and Sorbonne University in France, they summarized the key experiences in building large visual models and pointed out a way for developers.

HuggingFace teaches you how to make a SOTA visual modelPicture

These experiences cover many aspects such as model architecture selection, training methods, and training data. The author gives a detailed summary after multiple comparisons. The core points include:

  • If you want to do a good job in large visual models, the choice of architecture is very important.
  • The language model has a greater impact on the overall performance than the visual module.
  • Adopting a staged pre-training strategy is more conducive to building model capabilities.
  • Training data should contain multiple types, and pay attention to the balance between them.

It can be said that HF ​​was able to create Idefics2, a SOTA visual model of the same scale, relying on these experiences.

Idefics2 is based on Mistral-7B. It has an overall parameter volume of 8B and can accurately recognize handwritten fonts.

HuggingFace teaches you how to make a SOTA visual modelPicture

This is a good review by professionals saying that this is a good survey report and is very useful for visual model developers. It is helpful, but at the same time, I also remind you not to treat it as a snake oil.

HuggingFace teaches you how to make a SOTA visual modelPicture

Of course, some people joke that any architecture data is just a cloud, and having a GPU is the most critical.

HuggingFace teaches you how to make a SOTA visual modelPicture

There is some truth, but joking aside, let’s take a look at what experiences HuggingFace has brought us.

From SOTA model development practice

These experiences in the HuggingFace paper come from the development process of the visual model Idefics2.

Compared with the previous generation Idefics1 and Flamingo, the same scale ex-SOTA, Idefics2 performed well on multiple data sets, even surpassing the larger 13B model.

At the same time, compared with MM1 which is slightly better than Idefics2 on the COCO data set, Idefics2 consumes significantly less tokens on each picture.

HuggingFace teaches you how to make a SOTA visual modelPicture

From the actual development of Idefics2, the experience HuggingFace has brought us includes at least the following aspects:

  • Backbone and architecture selection
  • Training methods and strategies
  • Data diversity and processing strategies

Language models have a greater impact on overall performance

The current large visual models are mainly developed in the form of language model + visual encoder. The author separately evaluated the impact of the two on the overall performance.

The results show that the quality of the language model is more important than the visual model.

With the same number of parameters, using a better language model (such as replacing Llama-7B with Mistral-7B) can significantly improve the performance of large visual models on downstream tasks.

The improvement brought by upgrading the visual encoder is relatively limited, so the best way to make a trade-off is to give priority to a stronger language model.

HuggingFace teaches you how to make a SOTA visual modelPicture

Of course this does not mean that upgrading the visual encoder has no effect. If conditions permit, choosing a better visual encoder can also Brings certain performance improvements.

In addition, you should also pay attention to the choice to match the downstream task. For example, on text recognition tasks, you should use a visual encoder that supports variable resolution; if the task requires high inference speed, you can choose a lighter weight magnitude model.

And in practical applications, inference speed and memory usage are also factors that need to be weighed. The SigLIP-SO400M selected by Idefics2 has achieved a good balance between performance and efficiency.

Select the architecture type according to your needs

Regarding the choice of architecture, this paper discusses the two common complete autoregressive and cross-attention.

The fully autoregressive architecture generates each output in an autoregressive manner, taking into account the dependencies of the entire sequence;

The latter allows the model to dynamically focus on one modality while processing another Different parts of each modality, enabling more flexible interaction between modalities.

In specific work, the author found that which architecture performs better depends on whether the pre-trained backbone is frozen.

(Simply put, if the pre-trained backbone participates in the formal training process, it is non-frozen, and if it does not participate, it is frozen)

If it is not frozen, the performance of the fully autoregressive architecture is better. On the contrary, the cross-attention architecture is better.

HuggingFace teaches you how to make a SOTA visual modelPicture

As for whether the backbone needs to be frozen, it depends on the focus of the developer's needs.

Under the condition of limited resources, if you need high performance and are highly sensitive to delay, freezing is more appropriate;

If you want the model to have higher flexibility and adaptability, you should Choose a non-freezing training method.

Specifically for Idefics2, we chose not to freeze the backbone, so we adopted a fully autoregressive architecture accordingly.

HuggingFace teaches you how to make a SOTA visual modelPicture

Experience in the training phase

Choosing the appropriate architecture is important, but the training process is also essential. In Idefics2 During the training process, the author summarized these experiences for our reference:

First, a staged pre-training strategy is adopted as a whole, using lower resolution images in the initial stage, and then introducing higher resolution PDF document, this approach can gradually build multiple capabilities of the model.

The second is to use Learned Pooling instead of directly feeding image features into the language model, which can significantly reduce the number of image tokens, significantly improve training and inference efficiency, and also bring about performance improvements.

The third is data enhancement. One method is to split the image into multiple sub-images and send them to the model during training. This can exchange computing time for stronger performance during inference, especially in tasks such as text recognition. Works, but not all images need to be treated this way.

Fourth, using more diverse data and tasks in the instruction fine-tuning phase can improve the generalization and robustness of the model.

In addition, in order to stabilize training, when the pre-trained single-modal backbone participates in training (not frozen), the author also uses LoRA technology to adapt the pre-training parameters.

Data diversity and processing strategies

In addition to the training process itself, the selected data will also have a significant impact on the performance of the model.

From the beginning of the collection stage, attention should be paid to selecting multiple types of data. For example, the data used by Idefics2 includes three categories - documents with image and text alignment (such as web pages), image-text pairs (such as Image title), and PDF document with OCR annotation.

The proportions of various types of data should also be appropriately balanced according to actual needs, rather than simply divided into equal parts.

As for the data scale, the more the better if conditions permit. Of course, attention should be paid to filtering out low-quality data.

Of course, collection is only a step to obtain training data. If you want to train the model well, certain processing is required.

Use different preprocessing and enhancement strategies for different types of data. For example, for OCR data, it is necessary to use higher resolution images, while other data can use lower resolution.

It should be noted that the original aspect ratio and resolution should be retained when processing images, which can greatly save the computational overhead of training and inference while improving the adaptability of the model.

If you think these experiences have inspired you, you can read the original paper for more details, and you are welcome to share your development experience in the comment area.

Paper address: https://www.php.cn/link/52c8b8d56837155b4870fc2658b676f0

The above is the detailed content of HuggingFace teaches you how to make a SOTA visual model. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

Beyond ORB-SLAM3! SL-SLAM: Low light, severe jitter and weak texture scenes are all handled Beyond ORB-SLAM3! SL-SLAM: Low light, severe jitter and weak texture scenes are all handled May 30, 2024 am 09:35 AM

Written previously, today we discuss how deep learning technology can improve the performance of vision-based SLAM (simultaneous localization and mapping) in complex environments. By combining deep feature extraction and depth matching methods, here we introduce a versatile hybrid visual SLAM system designed to improve adaptation in challenging scenarios such as low-light conditions, dynamic lighting, weakly textured areas, and severe jitter. sex. Our system supports multiple modes, including extended monocular, stereo, monocular-inertial, and stereo-inertial configurations. In addition, it also analyzes how to combine visual SLAM with deep learning methods to inspire other research. Through extensive experiments on public datasets and self-sampled data, we demonstrate the superiority of SL-SLAM in terms of positioning accuracy and tracking robustness.

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Can online maps still be like this? MapTracker: Use tracking to realize the new SOTA of online maps! Can online maps still be like this? MapTracker: Use tracking to realize the new SOTA of online maps! Apr 25, 2024 pm 05:01 PM

Written above & the author’s personal understanding is that this algorithm allows for online high-precision map construction. Our method, MapTracker, accumulates sensor streams into memory buffers for two displays: 1) Rasterlatents in Bird's Eye View (BEV) space and 2) Vectorlatents on road elements (i.e., crosswalks, lane lines, and road boundaries). The method draws on the query propagation paradigm in object tracking, which explicitly associates the tracked road elements of the previous frame with the current frame, while fusing a subset of memory latents with distance strides to open source link: https: //map-tracker.github.io/ In summary, the main contributions of this article are as follows: A new

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles