Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation

By Dr. Mike

February 2, 2026

0

3

Residence

Desk of Contents

Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation
Configuring Your Growth Atmosphere
Setup and Imports
Loading the SAM 3 Mannequin
Downloading a Few Photos

Multi-Textual content Prompts on a Single Picture
Batched Inference Utilizing A number of Textual content Prompts Throughout A number of Photos
Single Bounding Field Immediate
A number of Bounding Field Prompts on a Single Picture (Twin Optimistic Foreground Areas)
A number of Bounding Field Prompts on a Single Picture (Optimistic Foreground and Adverse Background Management)
Combining Textual content and Visible Prompts for Selective Segmentation (Excluding the Undesired Areas)
Batched Blended-Immediate Segmentation Throughout Two Photos (Textual content and Bounding Field Steering)
Interactive Segmentation Utilizing Bounding Field Refinement (Draw to Phase)
Interactive Segmentation Utilizing Level-Based mostly Refinement (Click on to Information the Mannequin)

Abstract

Quotation Info

Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation

Welcome to Half 2 of our SAM 3 tutorial. In Half 1, we explored the theoretical foundations of SAM 3 and demonstrated primary text-based segmentation. Now, we unlock its full potential by mastering superior prompting strategies and interactive workflows.

advanced-sam-3-multi-modal-prompting-and-interactive-segmentation-featured.png

SAM 3’s true energy lies in its flexibility; it doesn’t simply settle for textual content prompts. It could actually course of a number of textual content queries concurrently, interpret bounding field coordinates, mix textual content with visible cues, and reply to interactive point-based steering. This multi-modal method allows subtle segmentation workflows that have been beforehand impractical with conventional fashions.

In Half 2, we’ll cowl:

Multi-prompt Segmentation: Question a number of ideas in a single picture
Batched Inference: Course of a number of photographs with totally different prompts effectively
Bounding Field Steering: Use spatial hints for exact localization
Optimistic and Adverse Prompts: Embody desired areas whereas excluding undesirable areas
Hybrid Prompting: Mix textual content and visible cues for selective segmentation
Interactive Refinement: Draw bounding containers and click on factors for real-time segmentation management

Every approach is demonstrated with full code examples and visible outputs, offering production-ready workflows for information annotation, video modifying, scientific analysis, and extra.

This lesson is the 2nd of a 4-part sequence on SAM 3:

SAM 3: Idea-Based mostly Visible Understanding and Segmentation
Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation (this tutorial)
Lesson 3
Lesson 4

To discover ways to carry out superior multi-modal prompting and interactive segmentation with SAM 3, simply hold studying.

In search of the supply code to this put up?

Would you want quick entry to three,457 photographs curated and labeled with hand gestures to coach, discover, and experiment with … at no cost? Head over to Roboflow and get a free account to seize these hand gesture photographs.

Configuring Your Growth Atmosphere

To comply with this information, it’s worthwhile to have the next libraries put in in your system.

!pip set up --q git+https://github.com/huggingface/transformers supervision jupyter_bbox_widget

We set up the transformers library to load the SAM 3 mannequin and processor, the supervision library for annotation, drawing, and inspection (which we use later to visualise bounding containers and segmentation outputs). Moreover, we set up jupyter_bbox_widget, an interactive widget that runs inside a pocket book, enabling us to click on on the picture so as to add factors or draw bounding containers.

We additionally go the --q flag to cover set up logs. This retains pocket book output clear.

Want Assist Configuring Your Growth Atmosphere?

Having bother configuring your improvement setting? Need entry to pre-configured Jupyter Notebooks operating on Google Colab? Make sure you be part of PyImageSearch College — you’ll be up and operating with this tutorial in a matter of minutes.

All that stated, are you:

Brief on time?
Studying in your employer’s administratively locked system?
Desirous to skip the trouble of preventing with the command line, bundle managers, and digital environments?
Able to run the code instantly in your Home windows, macOS, or Linux system?

Then be part of PyImageSearch College at this time!

Acquire entry to Jupyter Notebooks for this tutorial and different PyImageSearch guides pre-configured to run on Google Colab’s ecosystem proper in your internet browser! No set up required.

And better of all, these Jupyter Notebooks will run on Home windows, macOS, and Linux!

Setup and Imports

As soon as put in, we proceed to import the required libraries.

import io
import torch
import base64
import requests
import matplotlib
import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt

from google.colab import output
from speed up import Accelerator
from IPython.show import show
from jupyter_bbox_widget import BBoxWidget
from PIL import Picture, ImageDraw, ImageFont
from transformers import Sam3Processor, Sam3Model, Sam3TrackerProcessor, Sam3TrackerModel

We import the next:

io: Python’s built-in module for dealing with in-memory picture buffers when changing PIL photographs to base64 format
torch: used to run the SAM 3 mannequin, ship tensors to the GPU, and work with mannequin outputs
base64: used to transform our photographs into base64 strings in order that the BBox widget can show them within the pocket book
requests: a library to obtain photographs straight from a URL; this retains our workflow easy and avoids handbook file uploads

We additionally import a number of helper libraries:

matplotlib.pyplot: helps us visualize masks and overlays
numpy: provides us quick array operations
ipywidgets: allows interactive parts contained in the pocket book

We import the output utility from Colab, which we later use to allow interactive widgets. With out this step, our bounding field widget won’t render. We additionally import Accelerator from Hugging Face to run the mannequin effectively on both the CPU or GPU utilizing the identical code. It additionally simplifies gadget placement.

We import the show perform to render photographs and widgets straight in pocket book cells, and BBoxWidget serves because the core interactive instrument, permitting us to click on and draw bounding containers or factors on a picture. We use this as our immediate enter system.

We additionally import 3 lessons from Pillow:

Picture: hundreds RGB photographs
ImageDraw: helps us draw shapes on photographs
ImageFont: provides us textual content rendering help for overlays

Lastly, we import our SAM 3 instruments from transformers:

Sam3Processor: prepares inputs for the segmentation mannequin
Sam3Model: performs segmentation from textual content and field prompts
Sam3TrackerProcessor: prepares inputs for point-based or monitoring prompts
Sam3TrackerModel: runs point-based segmentation and masking

Loading the SAM 3 Mannequin

gadget = "cuda" if torch.cuda.is_available() else "cpu"

processor = Sam3Processor.from_pretrained("fb/sam3")
mannequin = Sam3Model.from_pretrained("fb/sam3").to(gadget)

First, we test if a GPU is offered within the setting. If PyTorch detects CUDA (Compute Unified System Structure) help, then we use the GPU for quicker inference. In any other case, we fall again to the CPU. This test ensures our code runs effectively on any machine (Line 1).

Subsequent, we load the Sam3Processor. The processor is answerable for making ready all inputs earlier than they attain the mannequin. It handles picture preprocessing, bounding field formatting, textual content prompts, and tensor conversion. Briefly, it makes our uncooked photographs appropriate with the mannequin (Line 3).

Lastly, we load the Sam3Model from Hugging Face. This mannequin takes the processed inputs and generates segmentation masks. We instantly transfer the mannequin to the chosen gadget (GPU or CPU) for inference (Line 4).

Downloading a Few Photos

!wget -q https://media.roboflow.com/notebooks/examples/birds.jpg
!wget -q https://media.roboflow.com/notebooks/examples/traffic_jam.jpg
!wget -q https://media.roboflow.com/notebooks/examples/basketball_game.jpg
!wget -q https://media.roboflow.com/notebooks/examples/dog-2.jpeg

Right here, we obtain just a few photographs from the Roboflow media server utilizing the wget command and use the -q flag to suppress output and hold the pocket book clear.

Multi-Textual content Prompts on a Single Picture

On this instance, we apply two totally different textual content prompts to the identical picture: participant in white and participant in blue. As an alternative of operating SAM 3 as soon as, we loop over each prompts, and every textual content question produces a brand new set of occasion masks. We then merge all detections right into a single outcome and visualize them collectively.

prompts = ["player in white", "player in blue"]
IMAGE_PATH = "/content material/basketball_game.jpg"

# Load picture
picture = Picture.open(IMAGE_PATH).convert("RGB")

all_masks = []
all_boxes = []
all_scores = []

total_objects = 0

for immediate in prompts:
   inputs = processor(
       photographs=picture,
       textual content=immediate,
       return_tensors="pt"
   ).to(gadget)

   with torch.no_grad():
       outputs = mannequin(**inputs)

   outcomes = processor.post_process_instance_segmentation(
       outputs,
       threshold=0.5,
       mask_threshold=0.5,
       target_sizes=inputs["original_sizes"].tolist()
   )[0]

   num_objects = len(outcomes["masks"])
   total_objects += num_objects

   print(f"Discovered {num_objects} objects for immediate: '{immediate}'")

   all_masks.append(outcomes["masks"])
   all_boxes.append(outcomes["boxes"])
   all_scores.append(outcomes["scores"])

outcomes = {
   "masks": torch.cat(all_masks, dim=0),
   "containers": torch.cat(all_boxes, dim=0),
   "scores": torch.cat(all_scores, dim=0),
}

print(f"nTotal objects discovered throughout all prompts: {total_objects}")

First, we outline our two textual content prompts. Every describes a distinct visible idea within the picture (Line 1). We additionally set the trail to our basketball recreation picture (Line 2). We load the picture and convert it to RGB. This ensures the colours are constant earlier than sending it to the mannequin (Line 5).

Subsequent, we initialize empty lists to retailer masks, bounding containers, and confidence scores for every immediate. We additionally monitor the entire variety of detections (Traces 7-11).

We run inference with out monitoring gradients. That is extra environment friendly and makes use of much less reminiscence. After inference, we post-process the outputs. We apply thresholds, convert logits to binary masks, and resize them to match the unique picture (Traces 13-28).

We depend the variety of objects detected for the present immediate, replace the operating complete, and print the outcome. We retailer the present immediate’s masks, containers, and scores of their respective lists (Traces 30-37).

As soon as the loop is completed, we concatenate all masks, bounding containers, and scores right into a single outcomes dictionary. This permits us to visualise all objects collectively, no matter which immediate produced them. We print the entire variety of detections throughout all prompts (Traces 39-45).

Beneath are the numbers of objects detected for every immediate, in addition to the entire variety of objects detected.

Discovered 5 objects for immediate: 'participant in white'

Discovered 6 objects for immediate: 'participant in blue'

Whole objects discovered throughout all prompts: 11

Output

labels = []
for immediate, scores in zip(prompts, all_scores):
   labels.lengthen([prompt] * len(scores))

overlay_masks_boxes_scores(
   picture=picture,
   masks=outcomes["masks"],
   containers=outcomes["boxes"],
   scores=outcomes["scores"],
   labels=labels,
   score_threshold=0.5,
   alpha=0.45,
)

Now, to visualise the output, we generate a listing of textual content labels. Every label matches the immediate that produced the detection (Traces 1-3).

Lastly, we visualize the whole lot without delay utilizing overlay_masks_boxes_scores. The output picture (Determine 1) reveals masks, bounding containers, and confidence scores for gamers in white and gamers in blue — cleanly layered on prime of the unique body (Traces 5-13).

**Determine 1:** Multi-text immediate segmentation of “participant in white” and “participant in blue” on a single picture (supply: visualization by the creator)

Batched Inference Utilizing A number of Textual content Prompts Throughout A number of Photos

On this instance, we run SAM 3 on two photographs without delay and supply a separate textual content immediate for every. This offers us a clear, parallel workflow: one batch, two prompts, two photographs, two units of segmentation outcomes.

cat_url = "http://photographs.cocodataset.org/val2017/000000077595.jpg"
kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
photographs = [
   Image.open(requests.get(cat_url, stream=True).raw).convert("RGB"),
   Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
]

text_prompts = ["ear", "dial"]

inputs = processor(photographs=photographs, textual content=text_prompts, return_tensors="pt").to(gadget)

with torch.no_grad():
   outputs = mannequin(**inputs)

# Submit-process outcomes for each photographs
outcomes = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs.get("original_sizes").tolist()
)

print(f"Picture 1: {len(outcomes[0]['masks'])} objects discovered")
print(f"Picture 2: {len(outcomes[1]['masks'])} objects discovered")

First, we outline two URLs. The primary factors to a cat picture. The second factors to a kitchen scene from COCO (Traces 1 and a pair of).

Subsequent, we obtain the 2 photographs, load them into reminiscence, and convert them to RGB. We retailer each photographs in a listing. This permits us to batch them later. Then, we outline one immediate per picture. The primary immediate searches for a cat’s ear. The second immediate seems to be for a dial within the kitchen scene (Traces 3-8).

We batch the pictures and batch the prompts right into a single enter construction. This offers SAM 3 two parallel vision-language duties, packed into one tensor (Line 10).

We disable gradient computation and run the mannequin in inference mode. The outputs comprise segmentation predictions for each photographs. We post-process the uncooked logits. SAM 3 returns outcomes as a listing: one entry per picture. Every entry accommodates occasion masks, bounding containers, and confidence scores (Traces 12-21).

We depend the variety of objects detected for every immediate. This offers us a easy, semantic abstract of mannequin efficiency (Traces 23 and 24).

Beneath is the entire variety of objects detected in every picture offered for every textual content immediate.

Picture 1: 2 objects discovered

Picture 2: 7 objects discovered

Output

for picture, outcome, immediate in zip(photographs, outcomes, text_prompts):
   labels = [prompt] * len(outcome["scores"])
   vis = overlay_masks_boxes_scores(picture, outcome["masks"], outcome["boxes"], outcome["scores"], labels)
   show(vis)

To visualise the output, we pair every picture with its corresponding immediate and outcome. For every batch entry, we do the next (Line 1):

create a label per detected object (Line 2)
visualize the masks, containers, and scores utilizing our overlay helper (Line 3)
show the annotated outcome within the pocket book (Line 4)

This method reveals how SAM 3 handles a number of textual content prompts and pictures concurrently, with out writing separate inference loops.

In Determine 2, we will see the item (ear) detected within the picture.

**Determine 2:** Batched inference outcome for Picture 1 displaying “ear” detections (supply: visualization by the creator)

In Determine 3, we will see the item (dial) detected within the picture.

**Determine 3:** Batched inference outcome for Picture 2 displaying “dial” detections (supply: visualization by the creator)

Single Bounding Field Immediate

On this instance, we carry out segmentation utilizing a bounding field as a substitute of a textual content immediate. We offer the mannequin with a spatial trace that claims: “focus right here.” SAM 3 then segments all detected cases of an idea offered by the spatial trace.

# Load picture
image_url = "http://photographs.cocodataset.org/val2017/000000077595.jpg"
picture = Picture.open(requests.get(image_url, stream=True).uncooked).convert("RGB")

# Field in xyxy format: [x1, y1, x2, y2]
box_xyxy = [100, 150, 500, 450]

input_boxes = [[box_xyxy]]        
input_boxes_labels = [[1]]          # 1 = constructive (foreground) field

def draw_input_box(picture, field, colour="pink", width=3):
   img = picture.copy().convert("RGB")
   draw = ImageDraw.Draw(img)
   x1, y1, x2, y2 = field
   draw.rectangle([(x1, y1), (x2, y2)], define=colour, width=width)
   return img

input_box_vis = draw_input_box(picture, box_xyxy)
input_box_vis

First, we load an instance COCO picture straight from a URL. We learn the uncooked bytes, open them with Pillow, and convert them to RGB (Traces 2 and three).

Subsequent, we outline a bounding field across the area to be segmented. The coordinates comply with the xyxy format (Line 6).

(x1, y1): top-left nook
(x2, y2): bottom-right nook

We put together the field for the processor.

The outer checklist signifies a batch measurement of 1. The internal checklist holds the one bounding field (Line 8).
We set the label to 1, that means this can be a constructive field, and SAM 3 ought to give attention to this area (Line 9).

Then, we outline a helper to visualise the immediate field. The perform attracts a coloured rectangle over the picture, making the immediate simple to confirm earlier than segmentation (Traces 11-16).

We show the enter field overlay. This confirms our immediate is appropriate earlier than operating the mannequin (Traces 18 and 19).

Determine 4 reveals the bounding field immediate overlaid on the enter picture.

**Determine 4:** Single bounding field immediate drawn over the enter picture (supply: visualization by the creator)

inputs = processor(
   photographs=picture,
   input_boxes=input_boxes,
   input_boxes_labels=input_boxes_labels,
   return_tensors="pt"
).to(gadget)

with torch.no_grad():
   outputs = mannequin(**inputs)

outcomes = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Discovered {len(outcomes['masks'])} objects")

Now, we put together the ultimate inputs for the mannequin. As an alternative of passing textual content, we go bounding field prompts. The processor handles resizing, padding, normalization, and tensor conversion. We then transfer the whole lot to the chosen gadget (GPU or CPU) (Traces 1-6).

We run SAM 3 in inference mode. The torch.no_grad() perform disables gradient computation, lowering reminiscence utilization and enhancing velocity (Traces 8 and 9).

After inference, we reshape and threshold the expected masks. We resize them again to their authentic sizes so that they align completely. We index [0] as a result of we’re working with a single picture (Traces 11-16).

We print the variety of foreground objects that SAM 3 detected throughout the bounding field (Line 18).

Discovered 1 objects

Output

labels = ["box-prompted object"] * len(outcomes["scores"])

overlay_masks_boxes_scores(
   picture=picture,
   masks=outcomes["masks"],
   containers=outcomes["boxes"],
   scores=outcomes["scores"],
   labels=labels,
   score_threshold=0.5,
   alpha=0.45,
)

To visualise the outcomes, we create a label string "box-prompted object" for every detected occasion to maintain the overlay trying clear (Line 1).

Lastly, we name our overlay helper. It blends the segmentation masks, attracts the bounding field, and reveals confidence scores on prime of the unique picture (Traces 3-11).

Determine 5 reveals the segmented object.

**Determine 5:** Segmentation outcome guided by a single bounding field immediate (supply: visualization by the creator)

A number of Bounding Field Prompts on a Single Picture (Twin Optimistic Foreground Areas)

On this instance, we information SAM 3 utilizing two constructive bounding containers. Every field marks a small area of curiosity contained in the picture: one across the oven dial and one round a close-by button. Each containers act as foreground alerts. SAM 3 then segments all detected objects inside these marked areas.

kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Picture.open(
   requests.get(kitchen_url, stream=True).uncooked
).convert("RGB")

box1_xyxy = [59, 144, 76, 163]   # Dial
box2_xyxy = [87, 148, 104, 159] # Button

input_boxes = [[box1_xyxy, box2_xyxy]]    
input_boxes_labels = [[1, 1]]               # 1 = constructive (foreground)

def draw_input_boxes(picture, containers, colour="pink", width=3):
   img = picture.copy().convert("RGB")
   draw = ImageDraw.Draw(img)

   for field in containers:
       x1, y1, x2, y2 = field
       draw.rectangle([(x1, y1), (x2, y2)], define=colour, width=width)

   return img

input_box_vis = draw_input_boxes(
   kitchen_image,
   [box1_xyxy, box2_xyxy]
)

input_box_vis

First, we load the kitchen picture from COCO. We obtain the uncooked picture bytes, open them with Pillow, and convert the picture to RGB. Subsequent, we outline two bounding containers. Each comply with the xyxy format. The primary field highlights the oven dial. The second field highlights the oven button (Traces 1-7).

We pack each bounding containers right into a single checklist, since we’re working with a single picture. We assign a worth of 1 to each containers, indicating that each are constructive prompts. We outline a helper perform to visualise the bounding field prompts. For every field, we draw a pink rectangle overlay on a duplicate of the picture (Traces 9-20).

We draw each containers and show the outcome. This offers us a visible affirmation of our bounding field prompts earlier than operating the mannequin (Traces 22-27).

Determine 6 reveals the 2 constructive bounding containers superimposed on the enter picture.

**Determine 6:** Two constructive bounding field prompts (dial and button) superimposed on the enter picture (supply: visualization by the creator)

inputs = processor(
   photographs=kitchen_image,
   input_boxes=input_boxes,
   input_boxes_labels=input_boxes_labels,
   return_tensors="pt"
).to(gadget)

with torch.no_grad():
   outputs = mannequin(**inputs)

outcomes = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Discovered {len(outcomes['masks'])} objects")

Now, we put together the picture and the bounding field prompts utilizing the processor. We then ship the tensors to the CPU or GPU. We run SAM 3 in inference mode. We disable gradient monitoring to enhance reminiscence and velocity (Traces 1-9).

Subsequent, we post-process the uncooked outputs. We resize masks again to their authentic form, and we filter low-confidence outcomes. We print the variety of detected objects that fall inside our two constructive bounding field prompts (Traces 11-18).

Beneath is the entire variety of objects detected within the picture.

Discovered 7 objects

Output

labels = ["box-prompted object"] * len(outcomes["scores"])

overlay_masks_boxes_scores(
   picture=kitchen_image,
   masks=outcomes["masks"],
   containers=outcomes["boxes"],
   scores=outcomes["scores"],
   labels=labels,
)

We generate a label for visualization. Lastly, we overlay the segmented objects on the picture utilizing the overlay_masks_boxes_scores perform (Traces 1-9).

Right here, Determine 7 shows all segmented objects.

**Determine 7:** Segmentation outcomes from twin constructive bounding field prompts (supply: visualization by the creator)

A number of Bounding Field Prompts on a Single Picture (Optimistic Foreground and Adverse Background Management)

On this instance, we information SAM 3 utilizing two bounding containers: one constructive and one unfavorable. The constructive field highlights the area we wish to section, whereas the unfavorable field tells the mannequin to disregard a close-by area. This mix provides us nice management over the segmentation outcome.

kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Picture.open(
   requests.get(kitchen_url, stream=True).uncooked
).convert("RGB")

box1_xyxy = [59, 144, 76, 163]   # Dial
box2_xyxy = [87, 148, 104, 159] # Button

input_boxes = [[box1_xyxy, box2_xyxy]]    
input_boxes_labels = [[1, 0]]              

def draw_input_boxes(picture, containers, labels, width=3):
   """
   containers  : checklist of [x1, y1, x2, y2]
   labels : checklist of ints (1 = constructive, 0 = unfavorable)
   """
   img = picture.copy().convert("RGB")
   draw = ImageDraw.Draw(img)

   for field, label in zip(containers, labels):
       x1, y1, x2, y2 = field

       # Coloration by label
       colour = "inexperienced" if label == 1 else "pink"

       draw.rectangle(
           [(x1, y1), (x2, y2)],
           define=colour,
           width=width,
       )

   return img

input_box_vis = draw_input_boxes(
   kitchen_image,
   containers=[box1_xyxy, box2_xyxy],
   labels=[1, 0],   # 1 = constructive, 0 = unfavorable
)

input_box_vis

First, we load our kitchen picture from the COCO dataset. We fetch the bytes from the URL and convert them to RGB (Traces 1-4).

Subsequent, we outline two bounding containers. Each comply with the xyxy coordinate format (Traces 6 and seven):

first field: surrounds the oven dial
second field: surrounds a close-by oven button

We pack the 2 containers right into a single checklist as a result of we’re working with a single picture. We set labels [1, 0], that means (Traces 9 and 10):

dial field: constructive (foreground to incorporate)
button field: unfavorable (space to exclude)

We outline a helper perform that attracts bounding containers in several colours. Optimistic prompts are drawn in inexperienced. Adverse prompts are drawn in pink (Traces 12-32).

We visualize the bounding field prompts overlaid on the picture. This offers us a transparent understanding of how we’re instructing SAM 3 (Traces 34-40).

Determine 8 reveals the constructive and unfavorable field prompts superimposed on the enter picture.

**Determine 8:** Optimistic (embrace) and unfavorable (exclude) bounding field prompts proven on the enter picture (supply: visualization by the creator)

inputs = processor(
   photographs=kitchen_image,
   input_boxes=input_boxes,
   input_boxes_labels=input_boxes_labels,
   return_tensors="pt"
).to(gadget)

with torch.no_grad():
   outputs = mannequin(**inputs)

outcomes = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Discovered {len(outcomes['masks'])} objects")

We put together the inputs for SAM 3. The processor handles preprocessing and tensor conversion. We carry out inference. Gradients are disabled to cut back reminiscence utilization. Subsequent, we post-process the outcomes. SAM 3 returns occasion masks filtered by confidence and resized to the unique decision (Traces 1-16).

We print the variety of objects segmented utilizing this foreground-background mixture (Line 18).

Beneath is the entire variety of objects detected within the picture.

Discovered 6 objects

Output

labels = ["box-prompted object"] * len(outcomes["scores"])

overlay_masks_boxes_scores(
   picture=kitchen_image,
   masks=outcomes["masks"],
   containers=outcomes["boxes"],
   scores=outcomes["scores"],
   labels=labels,
)

We assign labels to detections to make sure the overlay shows significant textual content. Lastly, we visualize the segmentation (Traces 1-9).

In Determine 9, the constructive immediate guides SAM 3 to section the dial, whereas the unfavorable immediate suppresses the close by button.

**Determine 9:** Segmentation outcome utilizing mixed constructive/unfavorable field steering to isolate the dial whereas suppressing a close-by area (supply: visualization by the creator)

Combining Textual content and Visible Prompts for Selective Segmentation (Excluding the Undesired Areas)

On this instance, we use two totally different immediate varieties on the similar time:

textual content immediate: to seek for "deal with"
unfavorable bounding field: to exclude the oven deal with area

This supplies selective management, permitting SAM 3 to give attention to handles within the scene whereas ignoring a particular space.

kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Picture.open(
   requests.get(kitchen_url, stream=True).uncooked
).convert("RGB")

# Phase "deal with" however exclude the oven deal with utilizing a unfavorable field
textual content = "deal with"
# Adverse field overlaying oven deal with space (xyxy): [40, 183, 318, 204]
oven_handle_box = [40, 183, 318, 204]
input_boxes = [[oven_handle_box]]

def draw_negative_box(picture, field, width=3):
   img = picture.copy().convert("RGB")
   draw = ImageDraw.Draw(img)

   x1, y1, x2, y2 = field
   draw.rectangle(
       [(x1, y1), (x2, y2)],
       define="pink",   # pink = unfavorable
       width=width,
   )

   return img

neg_box_vis = draw_negative_box(
   kitchen_image,
   oven_handle_box
)

neg_box_vis

First, we load the kitchen picture from the COCO dataset. We learn the file from the URL, open it as a Pillow picture, and convert it to RGB (Traces 1-4).

Subsequent, we outline the construction of our immediate. We wish to section handles within the kitchen, however exclude the massive oven deal with. We describe the idea utilizing textual content ("deal with") and draw a bounding field over the oven deal with area (Traces 7-10).

We write a helper perform to visualise our unfavorable area. We draw a pink bounding field to indicate that this space needs to be excluded. We show the unfavorable immediate overlay. This helps verify that the area is positioned accurately (Traces 12-30).

Figure 10 reveals the bounding field immediate to exclude the oven deal with area.

**Determine 10:** Adverse bounding field overlaying the oven deal with area to exclude it from segmentation (supply: visualization by the creator)

inputs = processor(
   photographs=kitchen_image,
   textual content="deal with",
   input_boxes=[[oven_handle_box]],
   input_boxes_labels=[[0]],   # unfavorable field
   return_tensors="pt"
).to(gadget)

with torch.no_grad():
   outputs = mannequin(**inputs)

outcomes = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Discovered {len(outcomes['masks'])} objects")

Right here, we put together the inputs for SAM 3. We mix textual content and bounding field prompts. We mark the bounding field with a 0 label, that means it’s a unfavorable area that the mannequin should ignore (Traces 1-7).

We run the mannequin in inference mode. This yields uncooked segmentation predictions primarily based on each immediate varieties. We post-process the outcomes by changing logits into binary masks, filtering low-confidence predictions, and resizing the masks again to the unique decision (Traces 9-17).

We report under the variety of handle-like objects remaining after excluding the oven deal with (Line 19).

Discovered 3 objects

Output

labels = ["handle (excluding oven)"] * len(outcomes["scores"])

final_vis = overlay_masks_boxes_scores(
   picture=kitchen_image,
   masks=outcomes["masks"],
   containers=outcomes["boxes"],
   scores=outcomes["scores"],
   labels=labels,
   score_threshold=0.5,
   alpha=0.45,
)

final_vis

We assign significant labels for visualization. Lastly, we draw masks, bounding containers, labels, and scores on the picture (Traces 1-13).

In Determine 11, the outcome reveals solely handles outdoors the unfavorable area.

**Determine 11:** Hybrid prompting outcome: `"deal with"` segmentation whereas excluding the oven deal with through a unfavorable field (supply: visualization by the creator)

Batched Blended-Immediate Segmentation Throughout Two Photos (Textual content and Bounding Field Steering)

On this instance, we exhibit how SAM 3 can deal with a number of immediate varieties in a single batch. The primary picture receives a textual content immediate ("laptop computer"), whereas the second picture receives a visible immediate (constructive bounding field). Each photographs are processed collectively in a single ahead go.

textual content=["laptop", None]
input_boxes=[None, [box2_xyxy]]
input_boxes_labels=[None, [1]]

def draw_input_box(picture, field, colour="inexperienced", width=3):
   img = picture.copy().convert("RGB")
   draw = ImageDraw.Draw(img)
   x1, y1, x2, y2 = field
   draw.rectangle([(x1, y1), (x2, y2)], define=colour, width=width)
   return img

input_vis_1 = photographs[0]  # textual content immediate → no field
input_vis_2 = draw_input_box(photographs[1], box2_xyxy)

First, we outline 3 parallel immediate lists:

1 for textual content
1 for bounding containers
1 for bounding field labels

We set the primary entry in every checklist to None for the primary picture as a result of we solely wish to use pure language there (laptop computer). For the second picture, we provide a bounding field and label it as constructive (1) (Traces 1-3).

We outline a small helper perform to attract a bounding field on a picture. This helps us visualize the immediate area earlier than inference. Right here, we put together two preview photographs (Traces 5-13):

first picture: reveals no field, since it’ll use textual content solely
second picture: is rendered with its bounding field immediate

input_vis_1

Determine 12 reveals no field over the picture, because it makes use of a textual content immediate for segmentation.

**Determine 12:** Batched mixed-prompt setup: Picture 1 makes use of a textual content immediate (no field overlay proven) (supply: picture by the creator)

input_vis_2

Determine 13 reveals a bounding field over the picture as a result of it makes use of a field immediate for segmentation.

**Determine 13:** Batched mixed-prompt setup: Picture 2 makes use of a constructive bounding field immediate (supply: visualization by the creator)

inputs = processor(
   photographs=photographs,
   textual content=["laptop", None],
   input_boxes=[None, [box2_xyxy]],
   input_boxes_labels=[None, [1]],
   return_tensors="pt"
).to(gadget)

with torch.no_grad():
   outputs = mannequin(**inputs)

outcomes = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)

Subsequent, we assemble the whole lot right into a single batched enter. This offers SAM 3:

2 photographs
2 immediate varieties
1 ahead go

We run SAM 3 inference with out computing gradients. This produces segmentation predictions for each photographs concurrently (Traces 1-10).

We post-process the mannequin outputs for each photographs. The result’s a two-element checklist (Traces 12-17):

entry [0]: corresponds to the laptop computer question
entry [1]: corresponds to the bounding field question

Output 1: Textual content Immediate Segmentation

labels_1 = ["laptop"] * len(outcomes[0]["scores"])

overlay_masks_boxes_scores(
   picture=photographs[0],
   masks=outcomes[0]["masks"],
   containers=outcomes[0]["boxes"],
   scores=outcomes[0]["scores"],
   labels=labels_1,
   score_threshold=0.5,
)

We apply a label to every detected object within the first picture. We visualize the segmentation outcomes overlaid on the primary picture (Traces 1-10).

In Determine 14, we observe detections guided by the textual content immediate "laptop computer".

**Determine 14:** Textual content-prompt segmentation outcome for `"laptop computer"` in Picture 1 (supply: visualization by the creator)

Output 2: Bounding Field Immediate Segmentation

labels_2 = ["box-prompted object"] * len(outcomes[1]["scores"])

overlay_masks_boxes_scores(
   picture=photographs[1],
   masks=outcomes[1]["masks"],
   containers=outcomes[1]["boxes"],
   scores=outcomes[1]["scores"],
   labels=labels_2,
   score_threshold=0.5,
)

We create labels for the second picture. These detections are from the bounding field immediate. Lastly, we visualize the bounding field guided segmentation on the second picture (Traces 1-10).

In Determine 15, we will see the detections guided by the bounding field immediate.

**Determine 15:** Bounding-box-guided segmentation lead to Picture 2 (supply: visualization by the creator)

Interactive Segmentation Utilizing Bounding Field Refinement (Draw to Phase)

On this instance, we flip segmentation into a completely interactive workflow. We draw bounding containers straight over the picture utilizing a widget UI. Every drawn field turns into a immediate sign for SAM 3:

inexperienced (constructive) containers: establish areas we wish to section
pink (unfavorable) containers: exclude areas we wish the mannequin to disregard

After drawing, we convert the widget output into correct field coordinates and run SAM 3 to supply refined segmentation masks.

output.enable_custom_widget_manager()

# Load picture
url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
picture = Picture.open(requests.get(url, stream=True).uncooked).convert("RGB")

# Convert to base64
def pil_to_base64(img):
   buffer = io.BytesIO()
   img.save(buffer, format="PNG")
   return "information:picture/png;base64," + base64.b64encode(buffer.getvalue()).decode()

# Create widget
widget = BBoxWidget(
   picture=pil_to_base64(picture),
   lessons=["positive", "negative"]
)

widget

We allow customized widget help in Colab to make sure the bounding field UI renders correctly. We obtain the kitchen picture, load it into reminiscence, and convert it to RGB format (Traces 1-5).

Earlier than sending the picture into the widget, we convert it right into a base64 PNG buffer. This encoding step makes the picture displayable within the browser UI (Traces 8-11).

We create an interactive drawing widget. It shows the picture and permits the consumer so as to add labeled containers. Every field is tagged as both "constructive" or "unfavorable" (Traces 14-17).

We render the widget within the pocket book. At this level, the consumer can draw, transfer, resize, and delete bounding containers (Line 19).

In Determine 16, we will see the constructive and unfavorable bounding containers drawn by the consumer. The blue field signifies areas that belong to the item of curiosity, whereas the orange field marks background areas that needs to be ignored. These annotations function interactive steering alerts for refining the segmentation output.

**Determine 16:** Interactive field drawing UI displaying constructive and unfavorable field annotations (supply: picture by the creator)

print(widget.bboxes)

The widget.bboxes object shops metadata for each annotation drawn by the consumer on the picture. Every entry corresponds to a single field created within the interactive widget.

A typical output seems to be like this:

[{'x': 58, 'y': 147, 'width': 18, 'height': 18, 'label': 'positive'}, {'x': 88, 'y': 149, 'width': 18, 'height': 8, 'label': 'negative'}]

Every dictionary represents a single consumer annotation:

x and y: point out the top-left nook of the drawn field in pixel coordinates
width and peak: describe the dimensions of the field
label: tells us whether or not the annotation is a 'constructive' level (object) or a 'unfavorable' level (background)

def widget_to_sam_boxes(widget):
   containers = []
   labels = []

   for ann in widget.bboxes:
       x = int(ann["x"])
       y = int(ann["y"])
       w = int(ann["width"])
       h = int(ann["height"])

       x1 = x
       y1 = y
       x2 = x + w
       y2 = y + h

       label = ann.get("label") or ann.get("class")

       containers.append([x1, y1, x2, y2])
       labels.append(1 if label == "constructive" else 0)

   return containers, labels

containers, box_labels = widget_to_sam_boxes(widget)

print("Containers:", containers)
print("Labels:", box_labels)

We outline a helper perform to translate widget information into SAM-compatible xyxy coordinates. The widget provides us x/y + width/peak. We convert to SAM’s xyxy format.

We encode labels into SAM 3 format:

1: constructive area
0: unfavorable area

The perform returns legitimate field lists prepared for inference. We extract the interactive field prompts (Traces 23-45).

Beneath are the Containers and Labels within the required format.

Containers: [[58, 147, 76, 165], [88, 149, 106, 157]]

Labels: [1, 0]

inputs = processor(
   photographs=picture,
   input_boxes=[boxes],              # batch measurement = 1
   input_boxes_labels=[box_labels],
   return_tensors="pt"
).to(gadget)

with torch.no_grad():
   outputs = mannequin(**inputs)

outcomes = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Discovered {len(outcomes['masks'])} objects")

We go the picture and interactive field prompts into the processor. We run inference with out monitoring gradients. We convert logits into ultimate masks predictions. We print the variety of detected areas matching the interactive prompts (Traces 49-66).

Beneath is the variety of objects detected by the mannequin.

Discovered 6 objects

Output

labels = ["interactive object"] * len(outcomes["scores"])

overlay_masks_boxes_scores(
   picture=picture,
   masks=outcomes["masks"],
   containers=outcomes["boxes"],
   scores=outcomes["scores"],
   labels=labels,
   alpha=0.45,
)

We assign easy labels to every detected area and overlay masks, bounding containers, and scores on the unique picture (Traces 1-10).

This workflow demonstrates an efficient use case: human-guided refinement via dwell drawing instruments. With just some annotations, SAM 3 adapts the segmentation output, giving us precision management and quick visible suggestions.

In Determine 17, we will see the segmented areas based on the constructive and unfavorable bounding field prompts annotated by the consumer over the enter picture.

**Determine 17:** Interactive segmentation output produced from the user-drawn constructive/unfavorable field prompts (supply: visualization by the creator)

Interactive Segmentation Utilizing Level-Based mostly Refinement (Click on to Information the Mannequin)

On this instance, we section utilizing level prompts somewhat than textual content or bounding containers. We click on on the picture to mark constructive and unfavorable factors. The middle of every clicked level turns into a guiding coordinate, and SAM 3 makes use of these coordinates to refine segmentation. This workflow supplies fine-grained, pixel-level management, nicely fitted to interactive modifying or correction.

# Setup gadget
gadget = Accelerator().gadget

# Load mannequin and processor
print("Loading SAM3 mannequin...")
mannequin = Sam3TrackerModel.from_pretrained("fb/sam3").to(gadget)
processor = Sam3TrackerProcessor.from_pretrained("fb/sam3")
print("Mannequin loaded efficiently!")

# Load picture
IMAGE_PATH = "/content material/dog-2.jpeg"
raw_image = Picture.open(IMAGE_PATH).convert("RGB")

def pil_to_base64(img):
   """Convert PIL picture to base64 for BBoxWidget"""
   buffer = io.BytesIO()
   img.save(buffer, format="PNG")
   return "information:picture/png;base64," + base64.b64encode(buffer.getvalue()).decode()

We arrange our compute gadget utilizing the Accelerator() class. This robotically detects the GPU if accessible. We load the SAM 3 monitoring mannequin and processor. This variant helps point-based refinement and multi-mask output (Traces 2-7).

We load the canine picture into reminiscence and convert it to RGB format. The BBoxWidget expects picture information in base64 format. We write a helper perform to transform a PIL picture to base64 (Traces 11-18).

def get_points_from_widget(widget):
   """Extract level coordinates from widget bboxes"""
   positive_points = []
   negative_points = []
 
   for ann in widget.bboxes:
       x = int(ann["x"])
       y = int(ann["y"])
       w = int(ann["width"])
       h = int(ann["height"])
     
       # Get heart level of the bbox
       center_x = x + w // 2
       center_y = y + h // 2
     
       label = ann.get("label") or ann.get("class")
     
       if label == "constructive":
           positive_points.append([center_x, center_y])
       elif label == "unfavorable":
           negative_points.append([center_x, center_y])
 
   return positive_points, negative_points

We loop over bounding containers drawn on the widget and convert them into level coordinates. Every tiny bounding field turns into a middle level. We break up them into (Traces 20-42):

constructive factors: object
unfavorable factors: background

def segment_from_widget(b=None):
   """Run segmentation with factors from widget"""
   positive_points, negative_points = get_points_from_widget(widget)
 
   if not positive_points and never negative_points:
       print("⚠️ Please add a minimum of one level (draw small containers on the picture)!")
       return
 
   # Mix factors and labels
   all_points = positive_points + negative_points
   all_labels = [1] * len(positive_points) + [0] * len(negative_points)
 
   print(f"n🔄 Working segmentation...")
   print(f"  • {len(positive_points)} constructive factors: {positive_points}")
   print(f"  • {len(negative_points)} unfavorable factors: {negative_points}")
   # Put together inputs (4D for factors, 3D for labels)
   input_points = [[all_points]]  # [batch, object, points, xy]
   input_labels = [[all_labels]]   # [batch, object, labels]
 
   inputs = processor(
       photographs=raw_image,
       input_points=input_points,
       input_labels=input_labels,
       return_tensors="pt"
   ).to(gadget)
 
   # Run inference
   with torch.no_grad():
       outputs = mannequin(**inputs)
 
   # Submit-process masks
   masks = processor.post_process_masks(
       outputs.pred_masks.cpu(),
       inputs["original_sizes"]
   )[0]
 
   print(f"✅ Generated {masks.form[1]} masks with form {masks.form}")
 
   # Visualize outcomes
   visualize_results(masks, positive_points, negative_points)

This segment_from_widget perform handles (Traces 44-83):

studying constructive + unfavorable factors (Traces 46-58)
constructing SAM 3 inputs (Traces 60-68)
operating inference (Traces 71 and 72)
post-processing masks (Traces 75-78)
visualizing outcomes (Line 83)

We pack factors and labels into the proper mannequin format. The mannequin generates a number of ranked masks. Higher high quality masks seem at index 0.

def visualize_results(masks, positive_points, negative_points):
   """Show segmentation outcomes"""
   n_masks = masks.form[1]
 
   # Create determine with subplots
   fig, axes = plt.subplots(1, min(n_masks, 3), figsize=(15, 5))
   if n_masks == 1:
       axes = [axes]
 
   for idx in vary(min(n_masks, 3)):
       masks = masks[0, idx].numpy()
     
       # Overlay masks on picture
       img_array = np.array(raw_image)
       colored_mask = np.zeros_like(img_array)
       colored_mask[mask > 0] = [0, 255, 0]  # Inexperienced masks
     
       overlay = img_array.copy()
       overlay[mask > 0] = (img_array[mask > 0] * 0.5 + colored_mask[mask > 0] * 0.5).astype(np.uint8)
     
       axes[idx].imshow(overlay)
       axes[idx].set_title(f"Masks {idx + 1} (High quality Ranked)", fontsize=12, fontweight="daring")
       axes[idx].axis('off')
     
       # Plot factors on every masks
       for px, py in positive_points:
           axes[idx].plot(px, py, 'go', markersize=12, markeredgecolor="white", markeredgewidth=2.5)
       for nx, ny in negative_points:
           axes[idx].plot(nx, ny, 'ro', markersize=12, markeredgecolor="white", markeredgewidth=2.5)
 
   plt.tight_layout()
   plt.present()

We overlay segmentation masks over the unique picture. Optimistic factors are displayed as inexperienced dots. Adverse factors are proven in pink (Traces 85-116).

def reset_widget(b=None):
   """Clear all annotations"""
   widget.bboxes = []
   print("🔄 Reset! All factors cleared.")

This clears beforehand chosen factors so we will begin recent (Traces 118-121).

# Create widget for level choice
widget = BBoxWidget(
   picture=pil_to_base64(raw_image),
   lessons=["positive", "negative"]
)

Customers can click on so as to add factors wherever on the picture. The widget captures each place and label (Traces 124-127).

# Create UI buttons
segment_button = widgets.Button(
   description='🎯 Phase',
   button_style="success",
   tooltip='Run segmentation with marked factors',
   icon='test',
   format=widgets.Structure(width="150px", peak="40px")
)
segment_button.on_click(segment_from_widget)

reset_button = widgets.Button(
   description='🔄 Reset',
   button_style="warning",
   tooltip='Clear all factors',
   icon='refresh',
   format=widgets.Structure(width="150px", peak="40px")
)
reset_button.on_click(reset_widget)

We create UI buttons for:

operating segmentation (Traces 130-137)
clearing annotations (Traces 139-146)

# Show UI
print("=" * 70)
print("🎨 INTERACTIVE SAM3 SEGMENTATION WITH BOUNDING BOX WIDGET")
print("=" * 70)
print("n📋 Directions:")
print("  1. Draw SMALL containers on the picture the place you wish to mark factors")
print("  2. Label them as 'constructive' (object) or 'unfavorable' (background)")
print("  3. The CENTER of every field shall be used as some extent coordinate")
print("  4. Click on 'Phase' button to run SAM3")
print("  5. Click on 'Reset' to clear all factors and begin over")
print("n💡 Suggestions:")
print("  • Draw tiny containers - simply large enough to see")
print("  • Optimistic factors = components of the item you need")
print("  • Adverse factors = background areas to exclude")
print("n" + "=" * 70 + "n")

show(widgets.HBox([segment_button, reset_button]))
show(widget)

We render the interface side-by-side. The consumer can now:

click on constructive factors
click on unfavorable factors
run segmentation dwell
reset anytime

Output

In Determine 18, we will see the entire point-based segmentation course of.

Determine 18: Level-based interactive refinement workflow: deciding on factors and producing ranked masks (supply: GIF by the creator).

What’s subsequent? We advocate PyImageSearch College.

Course data:
86+ complete lessons • 115+ hours hours of on-demand code walkthrough movies • Final up to date: February 2026
★★★★★ 4.84 (128 Scores) • 16,000+ College students Enrolled

I strongly consider that in the event you had the suitable instructor you would grasp pc imaginative and prescient and deep studying.

Do you assume studying pc imaginative and prescient and deep studying needs to be time-consuming, overwhelming, and sophisticated? Or has to contain complicated arithmetic and equations? Or requires a level in pc science?

That’s not the case.

All it’s worthwhile to grasp pc imaginative and prescient and deep studying is for somebody to clarify issues to you in easy, intuitive phrases. And that’s precisely what I do. My mission is to vary schooling and the way complicated Synthetic Intelligence subjects are taught.

For those who’re critical about studying pc imaginative and prescient, your subsequent cease needs to be PyImageSearch College, probably the most complete pc imaginative and prescient, deep studying, and OpenCV course on-line at this time. Right here you’ll discover ways to efficiently and confidently apply pc imaginative and prescient to your work, analysis, and initiatives. Be a part of me in pc imaginative and prescient mastery.

Inside PyImageSearch College you may discover:

&test; 86+ programs on important pc imaginative and prescient, deep studying, and OpenCV subjects
&test; 86 Certificates of Completion
&test; 115+ hours hours of on-demand video
&test; Model new programs launched repeatedly, making certain you possibly can sustain with state-of-the-art strategies
&test; Pre-configured Jupyter Notebooks in Google Colab
&test; Run all code examples in your internet browser — works on Home windows, macOS, and Linux (no dev setting configuration required!)
&test; Entry to centralized code repos for all 540+ tutorials on PyImageSearch
&test; Simple one-click downloads for code, datasets, pre-trained fashions, and so forth.
&test; Entry on cell, laptop computer, desktop, and so forth.

Click on right here to hitch PyImageSearch College

Abstract

In Half 2 of this tutorial, we explored the superior capabilities of SAM 3, reworking it from a robust segmentation instrument into a versatile, interactive visible question system. We demonstrated the right way to leverage a number of immediate varieties (textual content, bounding containers, and factors) each individually and together to attain exact, context-aware segmentation outcomes.

We lined subtle workflows, together with:

Segmenting a number of ideas concurrently in the identical picture
Processing batches of photographs with totally different prompts effectively
Utilizing constructive bounding containers to give attention to areas of curiosity
Using unfavorable prompts to exclude undesirable areas
Combining textual content and visible prompts for selective, fine-grained management
Constructing absolutely interactive segmentation interfaces the place customers can draw containers or click on factors and see ends in real-time

These strategies showcase SAM 3’s versatility for real-world functions. Whether or not you’re constructing large-scale information annotation pipelines, creating clever video modifying instruments, creating AR experiences, or conducting scientific analysis, the multi-modal prompting capabilities we explored offer you pixel-perfect management over segmentation outputs.

Quotation Info

Thakur, P. “Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation,” PyImageSearch, P. Chugh, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, https://pyimg.co/5c4ag

@incollection{Thakur_2026_advanced-sam-3-multi-modal-prompting-and-interactive-segmentation,
  creator = {Piyush Thakur},
  title = {{Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
  12 months = {2026},
  url = {https://pyimg.co/5c4ag},
}

To obtain the supply code to this put up (and be notified when future tutorials are printed right here on PyImageSearch), merely enter your e-mail tackle within the kind under!

Obtain the Supply Code and FREE 17-page Useful resource Information

Enter your e-mail tackle under to get a .zip of the code and a FREE 17-page Useful resource Information on Laptop Imaginative and prescient, OpenCV, and Deep Studying. Inside you may discover my hand-picked tutorials, books, programs, and libraries that can assist you grasp CV and DL!

The put up Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation appeared first on PyImageSearch.

Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation

In search of the supply code to this put up?

Want Assist Configuring Your Growth Atmosphere?

Output

Output

Output

Output

Output

Output

Output 1: Textual content Immediate Segmentation

Output 2: Bounding Field Immediate Segmentation

Output

Output

What’s subsequent? We advocate PyImageSearch College.

Obtain the Supply Code and FREE 17-page Useful resource Information

Related Articles

Latest Articles