The Segment Anything Model, developed by Meta, attempts to raise the bar for ‘object segmentation’—the capacity of computers to distinguish between separate things in an image or video—based computer vision. By enabling a thorough grasp of the user’s surroundings, segmentation will be essential for making AR really usable.
The process of locating and separating items in an image or video is known as object segmentation. This procedure can be automated with AI’s aid, enabling real-time object identification and isolation. By enabling the system to be aware of diverse items in the user’s environment, this technology will be essential for developing a more usable AR experience.
The Problem
Consider the scenario where you’re wearing AR glasses and you want two floating virtual monitors to the left and right of your actual monitor. The system must be able to recognize what a monitor looks like in order to position the virtual monitors when it detects your real monitor unless you intend to explicitly tell it where it is.
Monitors exist in a variety of forms, dimensions, and hues, though. Computer vision systems occasionally have it even harder to recognize objects when they are reflected or obscured.
The key to unlocking a tone of AR use cases and making the technology actually practical will be having a quick and trustworthy segmentation system that can detect each object in the space surrounding you (like your monitor).
Research on computer-vision-based object segmentation has been continuing for many years, but one of the main problems is that you need to train an AI model by providing it a lot of photos to learn from in order to assist computers to understand what they’re looking at.
These models may be quite good at recognizing the objects they were trained on, but they may have trouble with unfamiliar objects. As a result, having a sufficient sample size of images for the algorithms to learn from is one of the main obstacles to object segmentation. However, gathering sufficient samples of images and properly annotating them for training is no easy feat.
SAM
SAM The Segment Anything Model (SAM) is a brand-new project that I Am Meta recently published work on. The company is sharing a segmentation model as well as a sizable collection of training photos for others to work upon.
It is the goal of the initiative to lessen the demand for task-specific modeling knowledge. SAM, a general segmentation model, is capable of recognizing any object in any picture or video, even ones that weren’t present during training.
SAM supports both automatic and interactive segmentation, enabling it to recognize certain items in a scene with minimal user involvement. Users have control over what SAM is trying to identify at any given time by “prompting” the system with clicks, boxes, and other prompts.
It’s simple to understand how this point-based prompting may be quite effective when used in conjunction with eye-tracking on an AR headset. In fact, Meta has used the system to demonstrate exactly that use case:
How SAM Knows All That
SAM’s training data, which includes a whopping 10 million photos and 1 billion recognized item shapes, contributes to some of its amazing abilities. According to Meta, it is significantly more extensive than current datasets, allowing SAM a lot more practice in the learning process and enabling it to segment a wide range of objects.
The SAM dataset is referred to by Meta as SA-1B, and the business is making the complete set available for use by other academics.
The distribution of this enormous training dataset and the work on promotable segmentation are both intended to hasten the study of image and video understanding. The business anticipates that the SAM model can be integrated into bigger systems, opening up a variety of applications in fields including augmented reality, content creation, scientific domains, and general AI systems.