This article is reprinted with the authorization of AI New Media Qubit (public account ID: QbitAI). Please contact the source for reprinting.
Now it’s time for the AI circle to compete with hand speed.
No, Meta’s SAM has just been launched a few days ago, and domestic programmers have come to superimpose a wave of buffs, integrating target detection, segmentation, and generation of major visual AI functions all in one!
For example, based on Stable Diffusion and SAM, you can seamlessly replace the chair in the photo with a sofa:
It is also so easy to change clothes and hair color :
As soon as the project was released, many people exclaimed: The hand speed is too fast!
Someone else said: There are new wedding photos of Yui Aragaki and I.
The above is the effect brought by Gounded-SAM. The project has received 1.8k stars on GitHub.
To put it simply, this is a zero-shot vision application that only needs to input images to automatically detect and segment images.
This research comes from IDEA Research Institute (Guangdong-Hong Kong-Macao Greater Bay Area Digital Economy Research Institute), whose founder and chairman is Shen Xiangyang.
Grounded SAM is mainly composed of two models: Grounding DINO and SAM.
SAM (Segment Anything) is a zero-sample segmentation model just launched by Meta 4 days ago.
It can generate masks for any objects in images/videos, including objects and images that have not appeared during the training process.
By allowing SAM to return a valid mask for any prompt, the model's output should be a reasonable mask among all possibilities, even if the prompt is ambiguous or points to multiple objects. This task is used to pretrain the model and solve general downstream segmentation tasks via hints.
The model framework mainly consists of an image encoder, a hint encoder and a fast mask decoder. After computing the image embedding, SAM is able to generate a segmentation based on any prompt in the web within 50 milliseconds.
Grounding DINO is an existing achievement of this research team.
This is a zero-shot detection model, which can generate object boxes and labels with text descriptions.
After combining the two, you can find any object in the picture through text description, and then use SAM's powerful segmentation capability to segment the mask in a fine-grained manner.
On top of these abilities, they also added the ability of Stable Diffusion, which is the controllable image generation shown at the beginning.
It is worth mentioning that Stable Diffusion has been able to achieve similar functions before. Just erase the image elements you want to replace and enter the text prompt.
This time, Grounded SAM can save the step of manual selection and control it directly through text description.
In addition, combined with BLIP (Bootstrapping Language-Image Pre-training), it generates image titles, extracts labels, and then generates object boxes and masks.
Currently, there are more interesting features under development.
For example, some expansion of characters: changing clothes, hair color, skin color, etc.
Public information shows that the institute is an international innovative research institution for artificial intelligence, digital economy industry and cutting-edge technology. Former chief scientist of Microsoft Asia Research Institute and former vice president of Microsoft Global Intelligence Shen Xiangyang Dr. serves as the founder and chairman.
For the future work of Grounded SAM, the team has several prospects:
It is worth mentioning that many of the team members of this project are active respondents in the AI field on Zhihu. This time they also answered questions about Grounded SAM on Zhihu. Content, interested children can leave a message to ask~
The above is the detailed content of Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team. For more information, please follow other related articles on the PHP Chinese website!