Table of Contents
Model selection
Summary
Home Technology peripherals AI In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Feb 19, 2024 am 09:57 AM
Model Vision

ImageNet accuracy was once the main indicator for evaluating model performance, but in today's computational vision field, this indicator gradually appears to be incomplete.

As computer vision models have become more complex, the variety of available models has increased significantly, from ConvNets to Vision Transformers. Training methods have also evolved to self-supervised learning and image-text pair training like CLIP, and are no longer limited to supervised training on ImageNet.

Although the accuracy of ImageNet is an important indicator, it is not sufficient to fully evaluate the performance of the model. Different architectures, training methods, and data sets may cause models to perform differently on different tasks, so relying solely on ImageNet to judge models may have limitations. When a model overfits the ImageNet dataset and reaches saturation in accuracy, the model's generalization ability on other tasks may be overlooked. Therefore, multiple factors need to be considered to evaluate the performance and applicability of the model.

Although CLIP's ImageNet accuracy is similar to ResNet, its visual encoder is more robust and transferable. This prompted researchers to explore the unique advantages of CLIP that were not apparent when considering only ImageNet metrics. This highlights the importance of analyzing other properties to help discover useful models.

Beyond this, traditional benchmarks cannot fully assess a model’s ability to handle real-world visual challenges, such as various camera angles, lighting conditions, or occlusions. Models trained on datasets such as ImageNet often find it difficult to leverage their performance in practical applications because real-world conditions and scenarios are more diverse.

These questions have brought new confusion to practitioners in the field: How to measure a visual model? And how to choose a visual model that suits your needs?

In a recent paper, researchers from MBZUAI and Meta conducted an in-depth discussion on this issue.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm


  • Paper title: ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
  • Paper link: https://arxiv.org/pdf/2311.09215.pdf

The research focuses on model behavior beyond ImageNet accuracy, analyzing the performance of major models in the field of computer vision, including ConvNeXt and Vision Transformer (ViT), which perform well in supervised and CLIP training paradigms lower performance.

The selected models have a similar number of parameters and almost the same accuracy on ImageNet-1K under each training paradigm, ensuring a fair comparison. The researchers deeply explored a series of model characteristics, such as prediction error type, generalization ability, invariance of learned representations, calibration, etc., focusing on the characteristics of the model without additional training or fine-tuning, hoping to directly References are provided by practitioners using pretrained models.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

In the analysis, the researchers found that there were significant differences in model behavior across different architectures and training paradigms. For example, models trained under the CLIP paradigm produced fewer classification errors than those trained on ImageNet. However, the supervised model is better calibrated and generally outperforms on the ImageNet robustness benchmark. ConvNeXt has advantages on synthetic data, but is more texture-oriented than ViT. Meanwhile, supervised ConvNeXt performs well on many benchmarks, with transferability performance comparable to the CLIP model.

It can be seen that various models demonstrate their own advantages in unique ways, and these advantages cannot be captured by a single indicator. The researchers emphasize that more detailed evaluation metrics are needed to accurately select models in specific contexts and to create new ImageNet-agnostic benchmarks.

Based on these observations, Meta AI chief scientist Yann LeCun retweeted the study and liked:

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Model selection

For the supervised model, the researcher used ViT’s pre-trained DeiT3-Base/16, which has the same architecture as ViT-Base/16, but the training Method improvements; additionally ConvNeXt-Base is used. For the CLIP model, the researchers used the visual encoders of ViT-Base/16 and ConvNeXt-Base in OpenCLIP.

Please note that the performance of these models differs slightly from the original OpenAI models. All model checkpoints can be found on the GitHub project homepage. See Table 1 for detailed model comparison:

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

The researcher gave a detailed explanation of the model selection process:

1. Since researchers use pre-trained models, they cannot control the quantity and quality of data samples seen during training.

2. To analyze ConvNets and Transformers, many previous studies have compared ResNet and ViT. This comparison generally goes against ConvNet, as ViT is typically trained with more advanced recipes and achieves higher ImageNet accuracy. ViT also has some architectural design elements, such as LayerNorm, that were not incorporated into ResNet when it was invented many years ago. Therefore, for a more balanced evaluation, we compared ViT with ConvNeXt, a modern representative of ConvNet that performs on par with Transformers and shares many designs.

3. In terms of training mode, the researchers compared the supervised mode and the CLIP mode. Supervised models have maintained state-of-the-art performance in computer vision. CLIP models, on the other hand, perform well in terms of generalization and transferability and provide properties for connecting visual and linguistic representations.

4. Since the self-supervised model showed similar behavior to the supervised model in preliminary testing, it was not included in the results. This may be due to the fact that they ended up being supervised fine-tuned on ImageNet-1K, which affects the study of many features.

Next, let’s take a look at how researchers analyzed different attributes.

Analysis

Model error

ImageNet-X is a dataset extending ImageNet-1K with detailed human annotation of 16 factors of variation, enabling in-depth analysis of model errors in image classification. It uses an error ratio metric (lower is better) to quantify a model's performance on specific factors relative to overall accuracy, allowing for a nuanced analysis of model errors. Results on ImageNet-X show:

#1. CLIP models make fewer errors in ImageNet accuracy relative to supervised models.

2. All models are mainly affected by complex factors such as occlusion.

3. Texture is the most challenging element of all models.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Shape/Texture Deviation

Shape-Texture Deviation will detect the model Whether to rely on fragile texture shortcuts rather than high-level shape cues. This bias can be studied by combining cue-conflicting images of different categories of shape and texture. This approach helps to understand to what extent a model's decisions are based on shape compared to texture. The researchers evaluated the shape-texture bias on the cue conflict dataset and found that the texture bias of the CLIP model was smaller than that of the supervised model, while the shape bias of the ViT model was higher than that of ConvNets.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Model Calibration

Calibrate the prediction confidence of the quantifiable model and its Consistency of actual accuracy can be assessed through metrics such as expected calibration error (ECE) and visualization tools such as reliability plots and confidence histograms. Calibration was evaluated on ImageNet-1K and ImageNet-R, classifying predictions into 15 levels. During the experiment, the researchers observed the following points:

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

#1. The CLIP model is overconfident, while the supervised model is slightly underconfident.

2. Supervised ConvNeXt performs better calibration than supervised ViT.

Robustness and transferability

The robustness and transferability of the model are important for adapting to changes in data distribution and The new mission is crucial. The researchers evaluated the robustness using various ImageNet variants and found that although the average performance of ViT and ConvNeXt models was comparable, except for ImageNet-R and ImageNet-Sketch, supervised models generally outperformed CLIP in terms of robustness. . In terms of transferability, supervised ConvNeXt outperforms ViT and is almost on par with the performance of the CLIP model, as evaluated on the VTAB benchmark using 19 datasets.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Synthetic data

Synthetic data sets such as PUG-ImageNet can accurately Controlling factors such as camera angle and texture is a promising research path, so the researchers analyzed the performance of the model on synthetic data. PUG-ImageNet contains photorealistic ImageNet images with systematic variation in factors such as pose and lighting, and performance is measured as absolute top-1 accuracy. The researchers provide results on different factors in PUG-ImageNet and find that ConvNeXt outperforms ViT in almost all factors. This shows that ConvNeXt outperforms ViT on synthetic data, while the gap for the CLIP model is smaller because the accuracy of the CLIP model is lower than the supervised model, which may be related to the lower accuracy of the original ImageNet.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Transformation invariance

Transformation invariance refers to the ability of the model to generate Consistent representations that are not affected by input transformations thus preserving semantics, such as scaling or moving. This property enables the model to generalize well across different but semantically similar inputs. Methods used include resizing images for scale invariance, moving crops for position invariance, and adjusting the resolution of ViT models using interpolated positional embeddings.

They evaluate scale, motion, and resolution invariance on ImageNet-1K by varying crop scale/position and image resolution. ConvNeXt outperforms ViT in supervised training. Overall, the model is more robust to scale/resolution transformations than to movements. For applications that require high robustness to scaling, displacement, and resolution, the results suggest that supervised ConvNeXt may be the best choice.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Summary

Overall, each model has its own unique advantages. This suggests that model selection should depend on the target use case, as standard performance metrics may ignore critical nuances of a specific task. Furthermore, many existing benchmarks are derived from ImageNet, which also biases the evaluation. Developing new benchmarks with different data distributions is crucial to evaluate models in a more real-world representative environment.

The following is a summary of the conclusions of this article:

ConvNet and Transformer

1. Supervised ConvNeXt outperforms supervised ViT on many benchmarks: it is better calibrated, more invariant to data transformations, and exhibits better transferability and robustness.

2. ConvNeXt performs better than ViT on synthetic data.

3. ViT has greater shape deviation.

Supervision and CLIP

#1. Although the CLIP model is superior in terms of transferability, there is supervision ConvNeXt performed competitively in this task. This demonstrates the potential of supervised models.

2. Supervised models perform better on robustness benchmarks, probably because these models are all ImageNet variants.

3. The CLIP model has greater shape bias and fewer classification errors compared to ImageNet’s accuracy.

The above is the detailed content of In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Beyond ORB-SLAM3! SL-SLAM: Low light, severe jitter and weak texture scenes are all handled Beyond ORB-SLAM3! SL-SLAM: Low light, severe jitter and weak texture scenes are all handled May 30, 2024 am 09:35 AM

Written previously, today we discuss how deep learning technology can improve the performance of vision-based SLAM (simultaneous localization and mapping) in complex environments. By combining deep feature extraction and depth matching methods, here we introduce a versatile hybrid visual SLAM system designed to improve adaptation in challenging scenarios such as low-light conditions, dynamic lighting, weakly textured areas, and severe jitter. sex. Our system supports multiple modes, including extended monocular, stereo, monocular-inertial, and stereo-inertial configurations. In addition, it also analyzes how to combine visual SLAM with deep learning methods to inspire other research. Through extensive experiments on public datasets and self-sampled data, we demonstrate the superiority of SL-SLAM in terms of positioning accuracy and tracking robustness.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles