CalculatingThe similarity between images is an open problem in computer vision.
Today, when image generation is popular all over the world, How to define "similarity" is also a key issue in evaluating the authenticity of generated images.
Although there are some relatively direct methods to calculate image similarity, such as measuring the difference in pixels (such as FSIM, SSIM), this method obtains The difference in similarity is far from the difference perceived by the human eye.
After the rise of deep learning, some researchers found that the intermediate representation obtained by some neural network classifiers, such as AlexNet, VGG, SqueezeNet, etc. after training on ImageNet can Used as a computation of perceptual similarity.
In other words, embedding is closer to people’s perception of the similarity of multiple images than pixels.
Of course, this is just a hypothesis.
Recently Google published a paper specifically studying whether the ImageNet classifier can better evaluate perceptual similarity.
Paper link: https://openreview.net/pdf?id=qrGKGZZvH0
Although there has been work on the BAPPS data set released in 2018, perceptual scores were studied on the first generation ImageNet classifier , In order to further evaluate the correlation between accuracy and perceptual score, as well as the impact of various hyperparameters, the research results of the latest ViT model are added to the paper.
The higher the accuracy, the worse the perceived similarity?As we all know, the features learned through training on ImageNet can be well transferred to many downstream tasks and improve the performance of downstream tasks, which also makes pre-training on ImageNet a standard operation.
Additionally, achieving higher accuracy on ImageNet often means better performance on a diverse set of downstream tasks, such as robustness to damaged images, Generalization performance to out-of-distribution data and transfer learning to smaller categorical data sets.
But in terms of perceptual similarity calculation, everything seems to be reversed.
Models that achieve high accuracy on ImageNet have worse perceptual scores, while those with "mid-range" scores perform best on the perceptual similarity task.
ImageNet 64 × 64 validation accuracy (x-axis), Perceptual score on 64 × 64 BAPPS dataset (y-axis), Each blue dot represents an ImageNet classifier
It can be seen that the better ImageNet classifier achieves a better perceptual score to a certain extent, but beyond a certain Threshold, increasing the accuracy will reduce the perceptual score. The accuracy of the classifier is moderate (20.0-40.0), and the best perceptual score can be obtained. The article also studies the impact of neural network hyperparameters on perceptual scores, such as width, depth, number of training steps, weight attenuation, label smoothing and dropout
For each hyperparameter, there is an optimal accuracy, and increasing the accuracy can improve the perceptual score, but this optimal value is quite low and is reached very early in the hyperparameter sweep.
In addition to this, improvements in classifier accuracy lead to worse perceptual scores.
As an example, the article gives the changes in perceptual scores relative to two hyperparameters: training steps in ResNets and width in ViTs.
Early-stopped ResNets achieved the best perceptual scores at different depth settings of 6, 50 and 200
ResNet-50 and ResNet The perceptual score of -200 reaches the highest value in the first few epochs of training, but after the peak, the perceptual score value of the better performing classifier drops more sharply.
The results show that the training and learning rate adjustment of ResNets can improve the accuracy of the model as the step increases. Likewise, after the peak, the model also exhibits a progressive decrease in perceptual similarity scores that matches this progressively increasing accuracy.
ViTs consists of a set of Transformer blocks applied to the input image. The width of the ViT model is the number of output neurons of a single Transformer block. Increasing the width can effectively improve the accuracy of the model.
The researchers obtained two models B/8 (i.e. Base-ViT model, patch size is 4) and L/4 (i.e. Large -ViT model) and evaluate accuracy and perceptual scores.
The results are again similar to those observed for early-stopping ResNets, with narrower ViTs with lower accuracy performing better than the default width.
However, the optimal widths of ViT-B/8 and ViT-L/4 are 6% and 12% of their default widths respectively, paper A more detailed list of experiments on other hyperparameters such as width, depth, number of training steps, weight decay, label smoothing and dropout across ResNet and ViTs is also provided.
So if you want to improve the perceived similarity, the strategy is simple, just reduce the accuracy appropriately.
Improving the perceptual score by scaling down the ImageNet model, the values in the table represent the values given by scaling on the model with default hyperparameters Improvements obtained from models with fixed hyperparameters
Based on the above conclusion, the paper proposes a simple strategy to improve the perceptual score of the architecture: shrink the model to reduce accuracy, until Achieve optimal perception score.
Also visible in the experimental results is the perceptual score improvement obtained by scaling down each model on each hyperparameter. Early stopping yields the highest score improvement across all architectures except ViT-L/4, and early stopping is the most effective strategy without the need for time-consuming grid searches.
In previous work, the perceptual similarity function was calculated using the Euclidean distance across the image space dimensions.
This approach assumes a direct correspondence between pixels, but this correspondence may not apply to curved, translated, or rotated images.
In this article, the researchers adopted two perceptual functions that rely on the global representation of the image, namely neural style transfer that captures the style similarity between two images. style loss function and normalized average pooling distance function.
The style loss function compares the inter-channel cross-correlation matrix between two images, while the average pooling function compares the spatially averaged global representation.
The global perceptual function consistently improves the perceptual score for both network training with default hyperparameters and ResNet-200 as a function of training epochs
We also explore some hypotheses to explain the relationship between accuracy and perceptual ratings and derive some additional insights.
For example, model accuracy without the commonly used skip connection is also inversely proportional to the perceptual score, with layers closer to the output having on average lower perceptual scores compared to layers closer to the input .
We also further explored distortion sensitivity, ImageNet category granularity and spatial frequency sensitivity.
In short, this paper explores the issue of whether improving classification accuracy will produce better perceptual metrics. It studies the relationship between accuracy and perceptual scores on ResNets and ViTs under different hyperparameters, and finds that perceptual scores are related to Accuracy shows an inverted U-shaped relationship, in which accuracy and perception scores are related to a certain extent, showing an inverted U-shaped relationship.
Finally, the article discusses the relationship between accuracy and perceptual score in detail, including skip connection, global similarity function, distortion sensitivity, hierarchical perceptual score, spatial frequency sensitivity and ImageNet Category granularity.
While the exact explanation for the trade-off between ImageNet accuracy and perceptual similarity remains a mystery, this paper is a first step forward.
The above is the detailed content of Subvert three concepts! Google's latest research: Is it more accurate to calculate 'similarity' with a poor-performance model?. For more information, please follow other related articles on the PHP Chinese website!