


GMMSeg, a new paradigm of generative semantic segmentation, can handle both closed set and open set recognition
The current mainstream semantic segmentation algorithm is essentially a discriminative classification model based on the softmax classifier, which directly models p (class|pixel feature) and completely ignores the underlying pixel data distribution, that is, p ( class|pixel feature). This limits the model's expressiveness and generalization on OOD (out-of-distribution) data.
In a recent study, researchers from Zhejiang University, University of Technology Sydney, and Baidu Research Institute proposed a new semantic segmentation paradigm - based on Gaussian mixture model (GMM) generative semantic segmentation model GMMSeg.
- ## Paper link: https://arxiv.org/abs/2210.02025
- Code link: https://github.com/leonnnop/GMMSeg
GMMSeg performs the joint distribution of pixels and categories Modeling uses the EM algorithm to learn a Gaussian mixture classifier (GMM Classifier) in the pixel feature space, and uses a generative paradigm to finely capture the pixel feature distribution of each category. Meanwhile, GMMSeg adopts discriminative loss to optimize deep feature extractors end-to-end. This gives GMMSeg the advantages of both discriminative and generative models.
Experimental results show that GMMSeg has achieved performance improvements on a variety of segmentation architectures and backbone networks; at the same time, without any post-processing or fine-tuning, GMMSeg can be directly applied to anomaly segmentation tasks.
To date, this is the first time that a semantic segmentation method can use a single model instance, In closed-set (closed-set) and open Achieve advanced performance simultaneously under open-world conditions. This is also the first time that generative classifiers have demonstrated advantages in large-scale vision tasks.
Discriminative vs. Generative ClassifierBefore delving into the existing segmentation paradigms and proposed methods, a brief introduction is given here. The concepts of discriminative and generative classifiers.
Suppose there is a data set D, which contains pairs of samples-label pairs (x, y); the ultimate goal of the classifier is to predict the sample classification probability p ( y|x). Classification methods can be divided into two categories: discriminative classifiers and generative classifiers.
- Discriminant classifier: directly models the conditional probability p (y|x); it only learns the optimal decision boundary for classification, but does not Considering the distribution of the sample itself, it cannot reflect the characteristics of the sample.
- Generative classifier: first model the joint probability distribution p (x, y), and then derive the classification conditional probability through Bayes’ theorem; its explicit To model the distribution of the data itself, a corresponding model is often established for each category. Compared with the discriminative classifier, it fully considers the characteristic information of the sample.
Most of the current mainstream pixel-by-pixel segmentation models use depth The network extracts pixel features, and then uses the softmax classifier to classify the pixel features. Its network architecture consists of two parts:
The first part is the pixel feature extractor, and its typical architecture is an encoder-decoder pair , pixel features are obtained by mapping the pixel input of the RGB space to the D-dimensional high-dimensional space.
The second part is the pixel classifier, which is the mainstream softmax classifier; it encodes the input pixel features into C-class Real number output (logits), and then use the softmax function to normalize the output (logits) and assign probability meaning, that is, use logits to calculate the posterior probability of pixel classification:
Ultimately, the complete model consisting of two parts will be optimized end-to-end with cross-entropy loss:
Here In the process, the model ignores the distribution of the pixels themselves and directly estimates the conditional probability p (c|x) of the pixel classification prediction. It can be seen that the mainstream softmax classifier is essentially a discriminative classifier.
The discriminant classifier has a simple structure, and because its optimization goal is directly aimed at reducing the discrimination error, it can often achieve excellent discriminant performance. However, at the same time, it has some fatal shortcomings that have not attracted the attention of existing work, which greatly affects the classification performance and generalization of the softmax classifier:
- First of all , which only models the decision boundary; completely ignores the distribution of pixel features, and therefore cannot model and utilize the specific characteristics of each category; weakening its generalization and expression capabilities.
- Secondly, it uses a single parameter pair (w,b) to model a class; in other words, the softmax classifier relies on the unimodality assumption ; This extremely strong and oversimplified assumption often fails to hold in practical applications, which results in only sub-optimal performance.
- Finally, the output of the softmax classifier cannot accurately reflect the true probabilistic meaning; its final prediction can only be used as a reference when comparing with other categories. This is also the fundamental reason why it is difficult for a large number of mainstream segmentation models to detect OOD input.
In response to these problems, the author believes that the current mainstream discriminative paradigm should be rethought, and the corresponding solution is given in this article: Generative semantic segmentation model— —GMMSeg.
Generative semantic segmentation model: GMMSeg
The author reorganized the semantic segmentation process from the perspective of a generative model. Compared with directly modeling the classification probability p (c|x), the generative classifier models the joint distribution p (x, c), and then derives it using Bayes’ theorem Classification probability:
Among them, for generalization considerations, the category prior p (c) is often set to a uniform distribution, and how to Modeling the category conditional distribution p (x|c) of pixel features has become the primary issue at present.
In this paper, namely GMMSeg, a Gaussian mixture model is used to model p (x|c), which has the following form:
When the number of components is not limited, the Gaussian mixture model can theoretically fit any distribution, so it is very elegant and powerful; at the same time, it The nature of hybrid models also makes it feasible to model multimodality, that is, to model intra-class variation. Based on this, this article uses maximum likelihood estimation to optimize the parameters of the model:
The classic solution is the EM algorithm, that is, by alternately executing E-M - Two-step stepwise optimization of F - function:
Specific to the optimization of Gaussian mixture models; the EM algorithm actually re-estimates the probability that the data points belong to each sub-model in the E-step. In other words, it is equivalent to soft clustering of pixels in the E-step; then, in the M-step, the clustering results can be used to update the model parameters again.
However, in practical applications, the author found that the standard EM algorithm converged slowly and the final results were poor. The author suspects that the EM algorithm is too sensitive to the initial values of parameter optimization, making it difficult to converge to a better local extreme point. Inspired by a series of recent clustering algorithms based on optimal transport theory, the author introduces an additional uniform prior to the mixture model distribution:
Correspondingly, the E-step in the parameter optimization process is transformed into a constrained optimization problem, as follows:
This process can be Intuitively understood, an equal distribution constraint is introduced to the clustering process: during the clustering process, data points can be evenly distributed to each sub-model to a certain extent. After introducing this constraint, this optimization process is equivalent to the optimal transmission problem listed in the following formula:
This formula can use Sinkhorn-Knopp The algorithm solves quickly. The entire improved optimization process is named Sinkhorn EM, which has been proven by some theoretical work to have the same global optimal solution as the standard EM algorithm, and is less likely to fall into the local optimal solution.
Online Hybrid optimization
After that, in the complete optimization process, the article uses an online hybrid optimization mode: Through the generative Sinkhorn EM, the Gaussian mixture classifier is continuously optimized in the gradually updated feature space; while for another part of the complete framework, that is, the pixel feature extractor part, based on the prediction results of the generative classifier, use Optimize with discriminative cross-entropy loss. The two parts are optimized alternately and aligned with each other, making the entire model tightly coupled and capable of end-to-end training:
In this process, the features The extraction part is only optimized through gradient backpropagation; while the generative classifier part is only optimized through SinkhornEM. It is this alternating optimization design that allows the entire model to be compactly integrated and inherit the advantages from the discriminative and generative models.
Ultimately, GMMSeg benefits from its generative classification architecture and online hybrid training strategy, demonstrating features that the discriminative softmax classifier does not have The advantages:
- First, benefiting from its universal architecture, GMMSeg is compatible with most mainstream segmentation models, that is, compatible with models that use softmax for classification: you only need to replace the discriminative softmax classifier. Painlessly enhance the performance of existing models.
- Secondly, due to the application of hybrid training mode, GMMSeg combines the advantages of generative and discriminative classifiers, and to a certain extent solves the problem that softmax cannot model intra-class changes. ; greatly improves its discriminative performance.
- Third, GMMSeg explicitly models the distribution of pixel features, that is, p (x|c); GMMSeg can directly give the probability that the sample belongs to each category , which enables it to naturally handle unseen OOD data.
Experimental results
The experimental results show that whether it is based on CNN architecture or Transformer architecture, it can achieve better results in widely used semantic segmentation data sets (ADE20K, Cityscapes , COCO-Stuff), GMMSeg can achieve stable and obvious performance improvements.
In addition, in the abnormal segmentation task, there is no need to perform the closed set task, that is, the regular If any modification is made to the trained model in the semantic segmentation task, GMMSeg can surpass other methods that require special post-processing in all common evaluation indicators.
The above is the detailed content of GMMSeg, a new paradigm of generative semantic segmentation, can handle both closed set and open set recognition. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Today I would like to share a recent research work from the University of Connecticut that proposes a method to align time series data with large natural language processing (NLP) models on the latent space to improve the performance of time series forecasting. The key to this method is to use latent spatial hints (prompts) to enhance the accuracy of time series predictions. Paper title: S2IP-LLM: SemanticSpaceInformedPromptLearningwithLLMforTimeSeriesForecasting Download address: https://arxiv.org/pdf/2403.05798v1.pdf 1. Large problem background model

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile
