Just now, Meta AI released Segment Anything Model (SAM) - the first basic model for image segmentation.
SAM can achieve one-click segmentation of any object from photos or videos, and can migrate to other tasks with zero samples.
Overall, SAM follows the idea of the basic model:
1. A very Simple yet scalable architecture that can handle multi-modal cues: text, keypoints, bounding boxes.
2. Intuitive annotation process, closely connected with model design.
3. A data flywheel that allows the model to be bootstrapped to a large number of unlabeled images.
And, it is no exaggeration to say that SAM has learned the general concept of "object", even for unknown objects, unfamiliar scenes (such as underwater and under microscopes), and blurry The same is true for the case.
In addition, SAM can also be generalized to new tasks and new fields, and practitioners no longer need to fine-tune the model themselves.
Paper address: https://ai.facebook.com/research/publications/segment-anything/
The most powerful thing is that Meta implements a completely different CV paradigm. You can specify a point, a bounding box, and a sentence in a unified framework prompt encoder to directly segment objects with one click.
In this regard, Tencent AI algorithm expert Jin Tian said, "The prompt paradigm in the NLP field has begun to extend to the CV field. This time, it may completely change the traditional prediction thinking of CV. . Now you can really use a model to segment any object, and it is dynamic!"
NVIDIA AI scientist Jim Fan even praised this: We are already here It’s the “GPT-3 moment” in the field of computer vision!
So, CV really doesn’t exist anymore?
SAM: "Cut out" all objects in any image with one click
Segment Anything is the first basic model dedicated to image segmentation.
Segmentation refers to identifying which image pixels belong to an object and has always been the core task of computer vision.
However, if you want to create an accurate segmentation model for a specific task, it usually requires highly specialized work by experts. This process requires an infrastructure for training AI and a large number of carefully annotated domains. Data, so the threshold is extremely high.
In order to solve this problem, Meta proposed a basic model for image segmentation-SAM. This hintable model, trained on diverse data, is not only adaptable to a variety of tasks, but also operates similarly to how hints are used in NLP models.
The SAM model grasps the concept of "what is an object" and can generate a mask for any object in any image or video, even objects it has not seen during training.
SAM is so versatile that it covers a variety of use cases and can be used in new imaging domains out of the box without additional training, whether it's underwater photos, Or a cell microscope. In other words, SAM already has the capability of zero-sample migration.
Meta said excitedly in the blog: It can be expected that in the future, SAM will be used in any application that needs to find and segment objects in images.
SAM can become part of a larger AI system to develop a more general multi-modal understanding of the world, for example, understanding the visual and textual content of web pages.
In the field of AR/VR, SAM can select objects based on the user’s line of sight and then “upgrade” the objects to 3D.
For content creators, SAM can extract image areas for collage, or video editing.
SAM can also locate and track animals or objects in videos, which is helpful for natural science and astronomy research.
In the past, there were two methods to solve the segmentation problem.
One is interactive segmentation, which can segment objects of any category, but requires a person to fine-tune the mask through iteration.
The second is automatic segmentation, which can segment specific objects defined in advance, but the training process requires a large number of manually labeled objects (for example, to segment a cat, thousands of example).
In short, neither of these two methods can provide a universal, fully automatic segmentation method.
And SAM can be seen as a generalization of these two methods, and it can easily perform interactive segmentation and automatic segmentation.
On the model's promptable interface, a wide range of segmentation tasks can be completed by simply designing the correct prompts (clicks, boxes, text, etc.) for the model.
Additionally, SAM is trained on a diverse, high-quality dataset containing over 1 billion masks, allowing the model to generalize to new objects and images beyond its capabilities. What was observed during training. As a result, practitioners no longer need to collect their own segmentation data to fine-tune models for use cases.
This kind of flexibility that can be generalized to new tasks and new fields is the first time in the field of image segmentation.
(1) SAM allows users to segment objects with one click, or interactively click many points, and can also use bounding box hints for the model.
(2) When faced with the ambiguity of segmented objects, SAM can output multiple valid masks, which is an essential capability for solving segmentation problems in the real world.
(3) SAM can automatically discover and block all objects in the image. (4) After precomputing image embeddings, SAM can generate segmentation masks for any prompt in real time, allowing users to interact with the model in real time.
The SAM trained by the researchers can return valid segmentation masks for any prompt. Cues can be foreground/background points, rough boxes or masks, free-form text, or generally any information that indicates that segmentation is needed in the image.
The requirement for effective masking simply means that even in cases where the prompt is ambiguous and may refer to multiple objects (e.g., a dot on a shirt may represent either the shirt or the person wearing the shirt ) , the output should be a reasonable mask of one of the objects.
The researchers observed that pre-training tasks and interactive data collection impose specific constraints on model design. constraint.
In particular, the model needs to run in real time on the CPU in a web browser so that standard staff can efficiently interact with SAM in real time for annotation.
While runtime constraints mean there is a trade-off between quality and runtime, the researchers found that in practice, simple designs can achieve good results.
SAM's image encoder produces one-time embeddings for images, while the lightweight decoder converts any hints into vector embeddings on the fly. These two sources of information are then combined in a lightweight decoder that predicts segmentation masks.
After calculating the image embedding, SAM can generate a segment of the image in just 50 milliseconds and give any prompt in the web browser.
The latest SAM model was trained on 256 A100 images for 68 hours (nearly 5 days).
Project demonstration
Prompts for specifying the content to be divided in the image, Various segmentation tasks can be implemented without additional training.
##Use interaction points and boxes as prompts
Automatically segment all elements in the image
Generate multiple valid masks for ambiguous prompts
SAM can accept input prompts from other systems.
For example, select the corresponding object based on the user's visual focus information transmitted from the AR/VR headset. Meta's development of AI that can understand the real world will pave the way for its future metaverse journey.
Alternatively, implement text-to-object segmentation using bounding box hints from the object detector.
The output mask can be used as input to other AI systems.
For example, the mask of an object can be tracked in a video, turned into 3D through imaging editing applications, or used for creative tasks such as collage.
SAM learned A general idea of what an object is - this understanding enables zero-shot generalization to unfamiliar objects and images without the need for additional training.
In the Box function, simply select the box and the recognition will be completed immediately.
#After clicking Everything, all objects recognized by the system are extracted immediately. After choosing Cut-Outs, you will get a triangular dumpling in seconds. In addition to the new models released, Meta Also released is SA-1B, the largest segmentation dataset to date. This dataset consists of 11 million diverse, high-resolution, privacy-preserving images, and 1.1 billion high-quality segmentation masks. The overall characteristics of the data set are as follows: · Total number of images: 11 million · Total number of masks: 1.1 billion · Average masks per image: 100 · Average image resolution: 1500 × 2250 pixels Note: Image or mask annotations do not have class tags Meta specifically emphasizes that these data are collected through our data engine, all Masks are all fully automatically generated by SAM. With the SAM model, collecting new segmentation masks is faster than ever, and interactively annotating a mask only takes about 14 seconds. The per-mask annotation process is only 2 times slower than annotating bounding boxes. Using the fastest annotation interface, annotating bounding boxes takes about 7 seconds. Compared to previous large-scale segmentation data collection efforts, SAM model COCO’s fully manual polygon-based mask annotation is 6.5 times faster than the previous largest data annotation effort (also model Auxiliary) 2 times faster. However, relying on interactive annotation masks is not enough to create more than 1 billion masked data set. Therefore, Meta built a data engine for creating SA-1B datasets. This data engine has three "gears": 1. Model auxiliary annotation 2. The mixture of fully automatic annotation and auxiliary annotation helps to increase the diversity of collected masks 3. Fully automatic mask creation enables the expansion of the data set Our final dataset includes over 1.1 billion segmentation masks collected on approximately 11 million authorized and privacy-preserving images. SA-1B has 400x more masks than any existing segmentation dataset. And human evaluation studies confirm that the masks are of high quality and diversity, and in some cases are even qualitatively comparable to previous masks from smaller, fully manually annotated datasets. ## Pictures of the SA-1B were obtained through photo providers from multiple countries, These countries span different geographic regions and income levels. While some geographic areas are still underrepresented, SA-1B has more images and better overall representation across all regions than previous segmentation datasets. Finally, Meta says it hopes this data can form the basis of new datasets that include additional annotations, such as textual descriptions associated with each mask. Ross Girshick ##Ross Girshick (often called the RBG guru) is a research scientist at the Facebook Artificial Intelligence Research Institute (FAIR), where he is committed to computer vision and machine learning research. In 2012, Ross Girshick received his PhD in Computer Science from the University of Chicago under the supervision of Pedro Felzenszwalb. Before joining FAIR, Ross was a researcher at Microsoft Research and a postdoc at the University of California, Berkeley, where his mentors were Jitendra Malik and Trevor Darrell. He received the 2017 PAMI Young Researcher Award and the 2017 and 2021 PAMI Mark Everingham Awards in recognition of his contributions to open source software. As we all know, Ross and He Kaiming jointly developed the target detection algorithm of the R-CNN method. In 2017, the Mask R-CNN paper by Ross and He Kaiming won the best paper in ICCV 2017. Meta created this segmentation basic model in the CV field, which made many netizens shout, “Now, CV really doesn’t exist. Exists." Meta scientist Justin Johnson said: "To me, Segment Anything's data engine and ChatGPT's RLHF represent the largest A new era of artificial intelligence. Instead of learning everything from noisy network data, it is better to cleverly apply human annotation combined with big data to unlock new capabilities. Supervised learning is back!" #The only regret is that the SAM model release was mainly led by Ross Girshick, but He Yuming was absent. Intimate friend "matrix Mingzi" said that this article further proves that multimodality is CV There is no tomorrow for pure CV. SA-1B dataset: 11 million images, 1.1 billion masks
RBG master leads the team
The above is the detailed content of Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV. For more information, please follow other related articles on the PHP Chinese website!