


Subvert three concepts! Google's latest research: Is it more accurate to calculate 'similarity' with a poor-performance model?
CalculatingThe similarity between images is an open problem in computer vision.
Today, when image generation is popular all over the world, How to define "similarity" is also a key issue in evaluating the authenticity of generated images.
Although there are some relatively direct methods to calculate image similarity, such as measuring the difference in pixels (such as FSIM, SSIM), this method obtains The difference in similarity is far from the difference perceived by the human eye.
After the rise of deep learning, some researchers found that the intermediate representation obtained by some neural network classifiers, such as AlexNet, VGG, SqueezeNet, etc. after training on ImageNet can Used as a computation of perceptual similarity.
In other words, embedding is closer to people’s perception of the similarity of multiple images than pixels.
Of course, this is just a hypothesis.
Recently Google published a paper specifically studying whether the ImageNet classifier can better evaluate perceptual similarity.
Paper link: https://openreview.net/pdf?id=qrGKGZZvH0
Although there has been work on the BAPPS data set released in 2018, perceptual scores were studied on the first generation ImageNet classifier , In order to further evaluate the correlation between accuracy and perceptual score, as well as the impact of various hyperparameters, the research results of the latest ViT model are added to the paper.
The higher the accuracy, the worse the perceived similarity?As we all know, the features learned through training on ImageNet can be well transferred to many downstream tasks and improve the performance of downstream tasks, which also makes pre-training on ImageNet a standard operation.
Additionally, achieving higher accuracy on ImageNet often means better performance on a diverse set of downstream tasks, such as robustness to damaged images, Generalization performance to out-of-distribution data and transfer learning to smaller categorical data sets.
But in terms of perceptual similarity calculation, everything seems to be reversed.
Models that achieve high accuracy on ImageNet have worse perceptual scores, while those with "mid-range" scores perform best on the perceptual similarity task.
ImageNet 64 × 64 validation accuracy (x-axis), Perceptual score on 64 × 64 BAPPS dataset (y-axis), Each blue dot represents an ImageNet classifier
It can be seen that the better ImageNet classifier achieves a better perceptual score to a certain extent, but beyond a certain Threshold, increasing the accuracy will reduce the perceptual score. The accuracy of the classifier is moderate (20.0-40.0), and the best perceptual score can be obtained. The article also studies the impact of neural network hyperparameters on perceptual scores, such as width, depth, number of training steps, weight attenuation, label smoothing and dropout
For each hyperparameter, there is an optimal accuracy, and increasing the accuracy can improve the perceptual score, but this optimal value is quite low and is reached very early in the hyperparameter sweep.
In addition to this, improvements in classifier accuracy lead to worse perceptual scores.
As an example, the article gives the changes in perceptual scores relative to two hyperparameters: training steps in ResNets and width in ViTs.
Early-stopped ResNets achieved the best perceptual scores at different depth settings of 6, 50 and 200
ResNet-50 and ResNet The perceptual score of -200 reaches the highest value in the first few epochs of training, but after the peak, the perceptual score value of the better performing classifier drops more sharply.
The results show that the training and learning rate adjustment of ResNets can improve the accuracy of the model as the step increases. Likewise, after the peak, the model also exhibits a progressive decrease in perceptual similarity scores that matches this progressively increasing accuracy.
ViTs consists of a set of Transformer blocks applied to the input image. The width of the ViT model is the number of output neurons of a single Transformer block. Increasing the width can effectively improve the accuracy of the model.
The researchers obtained two models B/8 (i.e. Base-ViT model, patch size is 4) and L/4 (i.e. Large -ViT model) and evaluate accuracy and perceptual scores.
The results are again similar to those observed for early-stopping ResNets, with narrower ViTs with lower accuracy performing better than the default width.
However, the optimal widths of ViT-B/8 and ViT-L/4 are 6% and 12% of their default widths respectively, paper A more detailed list of experiments on other hyperparameters such as width, depth, number of training steps, weight decay, label smoothing and dropout across ResNet and ViTs is also provided.
So if you want to improve the perceived similarity, the strategy is simple, just reduce the accuracy appropriately.
Improving the perceptual score by scaling down the ImageNet model, the values in the table represent the values given by scaling on the model with default hyperparameters Improvements obtained from models with fixed hyperparameters
Based on the above conclusion, the paper proposes a simple strategy to improve the perceptual score of the architecture: shrink the model to reduce accuracy, until Achieve optimal perception score.
Also visible in the experimental results is the perceptual score improvement obtained by scaling down each model on each hyperparameter. Early stopping yields the highest score improvement across all architectures except ViT-L/4, and early stopping is the most effective strategy without the need for time-consuming grid searches.
Global perceptual function
In previous work, the perceptual similarity function was calculated using the Euclidean distance across the image space dimensions.
This approach assumes a direct correspondence between pixels, but this correspondence may not apply to curved, translated, or rotated images.
In this article, the researchers adopted two perceptual functions that rely on the global representation of the image, namely neural style transfer that captures the style similarity between two images. style loss function and normalized average pooling distance function.
The style loss function compares the inter-channel cross-correlation matrix between two images, while the average pooling function compares the spatially averaged global representation.
The global perceptual function consistently improves the perceptual score for both network training with default hyperparameters and ResNet-200 as a function of training epochs
We also explore some hypotheses to explain the relationship between accuracy and perceptual ratings and derive some additional insights.
For example, model accuracy without the commonly used skip connection is also inversely proportional to the perceptual score, with layers closer to the output having on average lower perceptual scores compared to layers closer to the input .
We also further explored distortion sensitivity, ImageNet category granularity and spatial frequency sensitivity.
In short, this paper explores the issue of whether improving classification accuracy will produce better perceptual metrics. It studies the relationship between accuracy and perceptual scores on ResNets and ViTs under different hyperparameters, and finds that perceptual scores are related to Accuracy shows an inverted U-shaped relationship, in which accuracy and perception scores are related to a certain extent, showing an inverted U-shaped relationship.
Finally, the article discusses the relationship between accuracy and perceptual score in detail, including skip connection, global similarity function, distortion sensitivity, hierarchical perceptual score, spatial frequency sensitivity and ImageNet Category granularity.
While the exact explanation for the trade-off between ImageNet accuracy and perceptual similarity remains a mystery, this paper is a first step forward.
The above is the detailed content of Subvert three concepts! Google's latest research: Is it more accurate to calculate 'similarity' with a poor-performance model?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



General Matrix Multiplication (GEMM) is a vital part of many applications and algorithms, and is also one of the important indicators for evaluating computer hardware performance. In-depth research and optimization of the implementation of GEMM can help us better understand high-performance computing and the relationship between software and hardware systems. In computer science, effective optimization of GEMM can increase computing speed and save resources, which is crucial to improving the overall performance of a computer system. An in-depth understanding of the working principle and optimization method of GEMM will help us better utilize the potential of modern computing hardware and provide more efficient solutions for various complex computing tasks. By optimizing the performance of GEMM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile
