Commonly used loss functions for optimizing semantic segmentation models include Soft Jaccard loss, Soft Dice loss and Soft Tversky loss. However, these loss functions are incompatible with soft labels and therefore cannot support some important training techniques such as label smoothing, knowledge distillation, semi-supervised learning, and multiple annotators. These training techniques are very important to improve the performance and robustness of semantic segmentation models, so further research and optimization of loss functions are needed to support the application of these training techniques.
On the other hand, commonly used semantic segmentation evaluation indicators include mAcc and mIoU. However, these indicators have a preference for larger objects, which seriously affects the safety performance evaluation of the model.
To solve these problems, researchers at the University of Leuven and Tsinghua first proposed the JDT loss. JDT loss is a fine-tuning of the original loss function, which includes Jaccard Metric loss, Dice Semimetric loss and Compatible Tversky loss. The JDT loss is equivalent to the original loss function when dealing with hard labels, and is also fully applicable to soft labels. This improvement makes model training more accurate and stable.
The researchers successfully applied the JDT loss in four important scenarios: label smoothing, knowledge distillation, semi-supervised learning, and multiple annotators. These applications demonstrate the power of the JDT loss to improve model accuracy and calibration.
Picture
Paper link: https://arxiv.org/pdf/2302.05666.pdf
Picture
Paper link: https://arxiv.org/pdf/2303.16296.pdf
In addition, researchers also proposed fine-grained evaluation indicators. These fine-grained evaluation metrics are less biased against large-sized objects, provide richer statistical information, and can provide valuable insights for model and dataset auditing.
Moreover, the researchers conducted an extensive benchmark study that emphasized the need for evaluation not to be based on a single metric and discovered the important role of neural network structure and JDT loss in optimizing fine-grained metrics.
Picture
Paper link: https://arxiv.org/pdf/2310.19252.pdf
Code link: https://github.com/zifuwanggg/JDTLosses
Since Jaccard Index and Dice Score are defined on the set, So it's not directable. In order to make them differentiable, there are currently two common approaches: one is to use the relationship between the set and the Lp module of the corresponding vector, such as Soft Jaccard loss (SJL), Soft Dice loss (SDL) and Soft Tversky loss (STL). ).
They write the size of the set as the L1 module of the corresponding vector, and write the intersection of two sets as the inner product of the two corresponding vectors. The other is to use the submodular property of Jaccard Index to do Lovasz expansion on the set function, such as Lovasz-Softmax loss (LSL).
Picture
These loss functions assume that the output x of the neural network is a continuous vector, The label y is a discrete binary vector. If the label is a soft label, that is, when y is no longer a discrete binary vector, but a continuous vector, these loss functions are no longer compatible.
Taking SJL as an example, consider a simple single-pixel situation:
Picture
It can be found that for any y > 0, SJL will be minimized when x = 1 and maximized when x = 0. Since a loss function should be minimized when x = y, this is obviously unreasonable.
In order to make the original loss function compatible with soft labels, it is necessary to calculate the intersection and union of two sets, Introduce the symmetric difference between the two sets:
Picture
Note that the symmetric difference between the two sets can Written as the L1 module of the difference between two corresponding vectors:
Picture
Putting the above together, we proposed the JDT loss. They are a variant of SJL, Jaccard Metric loss (JML), a variant of SDL, Dice Semimetric loss (DML), and a variant of STL, Compatible Tversky loss (CTL).
Picture
We proved that JDT loss has Some of the following properties.
Property 1: JML is a metric, and DML is a semimetric.
Property 2: When y is a hard label, JML is equivalent to SJL, DML is equivalent to SDL, and CTL is equivalent to STL.
Property 3: When y is a soft label, JML, DML, and CTL are all compatible with soft labels, that is, x = y ó f(x, y) = 0.
Due to Property 1, they are also called Jaccard Metric loss and Dice Semimetric loss. Property 2 shows that in general scenarios where only hard labels are used for training, JDT loss can be directly used to replace the existing loss function without causing any changes.
We have conducted a lot of experiments and summarized some precautions for using JDT loss.
Note 1: Select the corresponding loss function based on the evaluation index. If the evaluation index is Jaccard Index, then JML should be selected; if the evaluation index is Dice Score, then DML should be selected; if you want to give different weights to false positives and false negatives, then CTL should be selected. Secondly, when optimizing fine-grained evaluation indicators, the JDT loss should also be changed accordingly.
Note 2: Combine JDT loss and pixel-level loss function (such as Cross Entropy loss, Focal loss). This article found that 0.25CE 0.75JDT is generally a good choice.
Note 3: It is best to use a shorter epoch for training. After adding the JDT loss, it generally only requires half the epochs of the Cross Entropy loss training.
Note 4: When performing distributed training on multiple GPUs, if there is no additional communication between GPUs, the JDT loss will incorrectly optimize fine-grained evaluation metrics, resulting in The effect becomes worse on traditional mIoU.
Note 5: When training on an extreme category imbalanced data set, it should be noted that the JDL loss is calculated separately on each category and then averaged, which may cause Training becomes erratic.
The experiment proves that compared with the baseline of Cross Entropy loss, adding JDT loss can effectively improve the accuracy of the model when training with hard labels. . The accuracy and calibration of the model can be further improved by introducing soft labels.
Picture
Only adding the JDT loss term during training, this article has achieved semantic segmentation Knowledge distillation, semi-supervised learning and multi-annotator SOTA.
Image] [image
Picture
Existing evaluation indicatorsSemantic segmentation is a pixel-level classification task, so The accuracy of each pixel can be calculated: overall pixel-wise accuracy (Acc). However, because Acc will be biased towards the majority category, PASCAL VOC 2007 adopts an evaluation index that calculates the pixel accuracy of each category separately and then averages it: mean pixel-wise accuracy (mAcc).
But since mAcc does not consider false positives, since PASCAL VOC 2008, the average intersection and union ratio (per-dataset mIoU, mIoUD) has been used as the evaluation index. PASCAL VOC was the first data set to introduce the semantic segmentation task, and the evaluation indicators it used were widely used in various subsequent data sets. Specifically, IoU can be written as:Picture
In order to calculate mIoUD, we first need to count for each category c the trueness of all I photos in the entire data set positive (TP), false positive (FP) and false negative (FN):
##Picture
Having the values for each category, we average by category to eliminate preference for the majority category:Picture
Because mIoUD sums together the TP, FP and FN of all pixels in the entire dataset, it will inevitably be biased towards those large-sized objects. In some application scenarios with high safety requirements, such as autonomous driving and medical images, there are often objects that are small but cannot be ignored. As shown in the picture below, the size of the cars in different photos is obviously different. Therefore, mIoUD's preference for large-sized objects will seriously affect its evaluation of model safety performance.##Fine-grained evaluation indicators
mIoUI
For each category c, we calculate an IoU on each photo i:
Picture
Next, for each photo i, we average all categories that have appeared in this photo :
Picture
Finally, we average the values of all the photos:
Picture
mIoUC
Similarly, after calculating After the IoU of each category c on each photo i, we can average all the photos in which each category c appears:
Finally, average the values of all categories:
Because not all categories will appear on all photos, so for some combinations of categories and photos, NULL values will appear, as shown in the figure below. When calculating mIoUI, the categories are averaged first and then the photos are averaged, while when mIoUC is calculated, the photos are averaged first and then the categories are averaged.
The result is that mIoUI may be biased towards categories that appear frequently (such as C1 in the figure below), which is generally not good. But on the other hand, when calculating mIoUI, because each photo has an IoU value, this can help us do some auditing and analysis of the model and data set.
Picture
Worst case evaluation index
For some application scenarios that pay great attention to security, we are often more concerned about the worst-case segmentation quality, and one benefit of fine-grained indicators is that they can calculate the corresponding worst-case indicators. Let's take mIoUC as an example. A similar method can also calculate the corresponding worst-case indicator of mIoUI.
For each category c, we first sort the IoU values of all the photos it has appeared in (assuming there are Ic such photos) in ascending order. Next, we set q to be a small number, such as 1 or 5. Then, we only use the top Ic * q% of the sorted photos to calculate the final value:
Pictures
After having the value of each class c, we can average by class as before to get the worst-case indicator of mIoUC.
We trained 15 models on 12 data sets and discovered the following phenomena.
Phenomenon 1: No model can achieve the best results on all evaluation indicators. Each evaluation index has a different focus, so we need to consider multiple evaluation indexes at the same time to conduct a comprehensive evaluation.
Phenomenon 2: There are some photos in some data sets that cause almost all models to achieve a very low IoU value. This is partly because the photos themselves are very challenging, such as some very small objects and strong contrast between light and dark, and partly because there are problems with the labels of these photos. Therefore, fine-grained evaluation metrics can help us conduct model audits (finding scenarios where models make mistakes) and dataset audits (finding wrong labels).
Phenomenon 3: The structure of the neural network plays a crucial role in optimizing fine-grained evaluation indicators. On the one hand, the improvement in the receptive field brought by structures such as ASPP (adopted by DeepLabV3 and DeepLabV3) can help the model recognize large-sized objects, thereby effectively improving the value of mIoUD; on the other hand, the gap between encoder and decoder Long connections (adopted by UNet and DeepLabV3) enable the model to recognize small-sized objects, thereby improving the value of fine-grained evaluation indicators.
Phenomenon 4: The value of the worst-case indicator is far lower than the value of the corresponding average indicator. The following table shows the mIoUC and corresponding worst-case indicator values of DeepLabV3-ResNet101 on multiple data sets. A question worth considering in the future is, how should we design the neural network structure and optimization method to improve the model's performance under the worst-case indicators?
Picture
Phenomenon 5: Loss function is crucial to optimizing fine-grained evaluation indicators role. Compared with the Cross Entropy loss benchmark, as shown in (0, 0, 0) in the following table, when the evaluation indicators become fine-grained, using the corresponding loss function can greatly improve the model's performance on fine-grained evaluation indicators. For example, on ADE20K, the difference in mIoUC loss between JML and Cross Entropy will be greater than 7%.
Picture
We only considered JDT loss as semantics loss functions for segmentation, but they can also be applied to other tasks, such as traditional classification tasks.
Secondly, JDT losses are only used in label space, but we believe that they can be used to minimize the distance between any two vectors in feature space, for example, to replace Lp module and cosine distance.
References:
https://arxiv.org/pdf/2302.05666.pdf
https://arxiv.org/pdf/ 2303.16296.pdf
https://arxiv.org/pdf/2310.19252.pdf
The above is the detailed content of Three papers solve the problem of 'Optimization and Evaluation of Semantic Segmentation'! Leuven/Tsinghua/Oxford and others jointly proposed a new method. For more information, please follow other related articles on the PHP Chinese website!