Desk of Contents
- Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation
- Configuring Your Growth Atmosphere
- Setup and Imports
- Loading the SAM 3 Mannequin
- Downloading a Few Photos
- Multi-Textual content Prompts on a Single Picture
- Batched Inference Utilizing A number of Textual content Prompts Throughout A number of Photos
- Single Bounding Field Immediate
- A number of Bounding Field Prompts on a Single Picture (Twin Optimistic Foreground Areas)
- A number of Bounding Field Prompts on a Single Picture (Optimistic Foreground and Adverse Background Management)
- Combining Textual content and Visible Prompts for Selective Segmentation (Excluding the Undesired Areas)
- Batched Blended-Immediate Segmentation Throughout Two Photos (Textual content and Bounding Field Steering)
- Interactive Segmentation Utilizing Bounding Field Refinement (Draw to Phase)
- Interactive Segmentation Utilizing Level-Based mostly Refinement (Click on to Information the Mannequin)
- Abstract
Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation
Welcome to Half 2 of our SAM 3 tutorial. In Half 1, we explored the theoretical foundations of SAM 3 and demonstrated primary text-based segmentation. Now, we unlock its full potential by mastering superior prompting strategies and interactive workflows.
SAM 3’s true energy lies in its flexibility; it doesn’t simply settle for textual content prompts. It could actually course of a number of textual content queries concurrently, interpret bounding field coordinates, mix textual content with visible cues, and reply to interactive point-based steering. This multi-modal method allows subtle segmentation workflows that have been beforehand impractical with conventional fashions.
In Half 2, we’ll cowl:
- Multi-prompt Segmentation: Question a number of ideas in a single picture
- Batched Inference: Course of a number of photographs with totally different prompts effectively
- Bounding Field Steering: Use spatial hints for exact localization
- Optimistic and Adverse Prompts: Embody desired areas whereas excluding undesirable areas
- Hybrid Prompting: Mix textual content and visible cues for selective segmentation
- Interactive Refinement: Draw bounding containers and click on factors for real-time segmentation management
Every approach is demonstrated with full code examples and visible outputs, offering production-ready workflows for information annotation, video modifying, scientific analysis, and extra.
This lesson is the 2nd of a 4-part sequence on SAM 3:
- SAM 3: Idea-Based mostly Visible Understanding and Segmentation
- Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation (this tutorial)
- Lesson 3
- Lesson 4
To discover ways to carry out superior multi-modal prompting and interactive segmentation with SAM 3, simply hold studying.
Would you want quick entry to three,457 photographs curated and labeled with hand gestures to coach, discover, and experiment with … at no cost? Head over to Roboflow and get a free account to seize these hand gesture photographs.
Configuring Your Growth Atmosphere
To comply with this information, it’s worthwhile to have the next libraries put in in your system.
!pip set up --q git+https://github.com/huggingface/transformers supervision jupyter_bbox_widget
We set up the transformers library to load the SAM 3 mannequin and processor, the supervision library for annotation, drawing, and inspection (which we use later to visualise bounding containers and segmentation outputs). Moreover, we set up jupyter_bbox_widget, an interactive widget that runs inside a pocket book, enabling us to click on on the picture so as to add factors or draw bounding containers.
We additionally go the --q flag to cover set up logs. This retains pocket book output clear.
Want Assist Configuring Your Growth Atmosphere?

All that stated, are you:
- Brief on time?
- Studying in your employer’s administratively locked system?
- Desirous to skip the trouble of preventing with the command line, bundle managers, and digital environments?
- Able to run the code instantly in your Home windows, macOS, or Linux system?
Then be part of PyImageSearch College at this time!
Acquire entry to Jupyter Notebooks for this tutorial and different PyImageSearch guides pre-configured to run on Google Colab’s ecosystem proper in your internet browser! No set up required.
And better of all, these Jupyter Notebooks will run on Home windows, macOS, and Linux!
Setup and Imports
As soon as put in, we proceed to import the required libraries.
import io import torch import base64 import requests import matplotlib import numpy as np import ipywidgets as widgets import matplotlib.pyplot as plt from google.colab import output from speed up import Accelerator from IPython.show import show from jupyter_bbox_widget import BBoxWidget from PIL import Picture, ImageDraw, ImageFont from transformers import Sam3Processor, Sam3Model, Sam3TrackerProcessor, Sam3TrackerModel
We import the next:
io: Python’s built-in module for dealing with in-memory picture buffers when changing PIL photographs to base64 formattorch: used to run the SAM 3 mannequin, ship tensors to the GPU, and work with mannequin outputsbase64: used to transform our photographs into base64 strings in order that the BBox widget can show them within the pocket bookrequests: a library to obtain photographs straight from a URL; this retains our workflow easy and avoids handbook file uploads
We additionally import a number of helper libraries:
matplotlib.pyplot: helps us visualize masks and overlaysnumpy: provides us quick array operationsipywidgets: allows interactive parts contained in the pocket book
We import the output utility from Colab, which we later use to allow interactive widgets. With out this step, our bounding field widget won’t render. We additionally import Accelerator from Hugging Face to run the mannequin effectively on both the CPU or GPU utilizing the identical code. It additionally simplifies gadget placement.
We import the show perform to render photographs and widgets straight in pocket book cells, and BBoxWidget serves because the core interactive instrument, permitting us to click on and draw bounding containers or factors on a picture. We use this as our immediate enter system.
We additionally import 3 lessons from Pillow:
Picture: hundreds RGB photographsImageDraw: helps us draw shapes on photographsImageFont: provides us textual content rendering help for overlays
Lastly, we import our SAM 3 instruments from transformers:
Sam3Processor: prepares inputs for the segmentation mannequinSam3Model: performs segmentation from textual content and field promptsSam3TrackerProcessor: prepares inputs for point-based or monitoring promptsSam3TrackerModel: runs point-based segmentation and masking
Loading the SAM 3 Mannequin
gadget = "cuda" if torch.cuda.is_available() else "cpu"
processor = Sam3Processor.from_pretrained("fb/sam3")
mannequin = Sam3Model.from_pretrained("fb/sam3").to(gadget)
First, we test if a GPU is offered within the setting. If PyTorch detects CUDA (Compute Unified System Structure) help, then we use the GPU for quicker inference. In any other case, we fall again to the CPU. This test ensures our code runs effectively on any machine (Line 1).
Subsequent, we load the Sam3Processor. The processor is answerable for making ready all inputs earlier than they attain the mannequin. It handles picture preprocessing, bounding field formatting, textual content prompts, and tensor conversion. Briefly, it makes our uncooked photographs appropriate with the mannequin (Line 3).
Lastly, we load the Sam3Model from Hugging Face. This mannequin takes the processed inputs and generates segmentation masks. We instantly transfer the mannequin to the chosen gadget (GPU or CPU) for inference (Line 4).
Downloading a Few Photos
!wget -q https://media.roboflow.com/notebooks/examples/birds.jpg !wget -q https://media.roboflow.com/notebooks/examples/traffic_jam.jpg !wget -q https://media.roboflow.com/notebooks/examples/basketball_game.jpg !wget -q https://media.roboflow.com/notebooks/examples/dog-2.jpeg
Right here, we obtain just a few photographs from the Roboflow media server utilizing the wget command and use the -q flag to suppress output and hold the pocket book clear.
Multi-Textual content Prompts on a Single Picture
On this instance, we apply two totally different textual content prompts to the identical picture: participant in white and participant in blue. As an alternative of operating SAM 3 as soon as, we loop over each prompts, and every textual content question produces a brand new set of occasion masks. We then merge all detections right into a single outcome and visualize them collectively.
prompts = ["player in white", "player in blue"]
IMAGE_PATH = "/content material/basketball_game.jpg"
# Load picture
picture = Picture.open(IMAGE_PATH).convert("RGB")
all_masks = []
all_boxes = []
all_scores = []
total_objects = 0
for immediate in prompts:
inputs = processor(
photographs=picture,
textual content=immediate,
return_tensors="pt"
).to(gadget)
with torch.no_grad():
outputs = mannequin(**inputs)
outcomes = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
num_objects = len(outcomes["masks"])
total_objects += num_objects
print(f"Discovered {num_objects} objects for immediate: '{immediate}'")
all_masks.append(outcomes["masks"])
all_boxes.append(outcomes["boxes"])
all_scores.append(outcomes["scores"])
outcomes = {
"masks": torch.cat(all_masks, dim=0),
"containers": torch.cat(all_boxes, dim=0),
"scores": torch.cat(all_scores, dim=0),
}
print(f"nTotal objects discovered throughout all prompts: {total_objects}")
First, we outline our two textual content prompts. Every describes a distinct visible idea within the picture (Line 1). We additionally set the trail to our basketball recreation picture (Line 2). We load the picture and convert it to RGB. This ensures the colours are constant earlier than sending it to the mannequin (Line 5).
Subsequent, we initialize empty lists to retailer masks, bounding containers, and confidence scores for every immediate. We additionally monitor the entire variety of detections (Traces 7-11).
We run inference with out monitoring gradients. That is extra environment friendly and makes use of much less reminiscence. After inference, we post-process the outputs. We apply thresholds, convert logits to binary masks, and resize them to match the unique picture (Traces 13-28).
We depend the variety of objects detected for the present immediate, replace the operating complete, and print the outcome. We retailer the present immediate’s masks, containers, and scores of their respective lists (Traces 30-37).
As soon as the loop is completed, we concatenate all masks, bounding containers, and scores right into a single outcomes dictionary. This permits us to visualise all objects collectively, no matter which immediate produced them. We print the entire variety of detections throughout all prompts (Traces 39-45).
Beneath are the numbers of objects detected for every immediate, in addition to the entire variety of objects detected.
Discovered 5 objects for immediate: 'participant in white'
Discovered 6 objects for immediate: 'participant in blue'
Whole objects discovered throughout all prompts: 11
Output
labels = [] for immediate, scores in zip(prompts, all_scores): labels.lengthen([prompt] * len(scores)) overlay_masks_boxes_scores( picture=picture, masks=outcomes["masks"], containers=outcomes["boxes"], scores=outcomes["scores"], labels=labels, score_threshold=0.5, alpha=0.45, )
Now, to visualise the output, we generate a listing of textual content labels. Every label matches the immediate that produced the detection (Traces 1-3).
Lastly, we visualize the whole lot without delay utilizing overlay_masks_boxes_scores. The output picture (Determine 1) reveals masks, bounding containers, and confidence scores for gamers in white and gamers in blue — cleanly layered on prime of the unique body (Traces 5-13).

Batched Inference Utilizing A number of Textual content Prompts Throughout A number of Photos
On this instance, we run SAM 3 on two photographs without delay and supply a separate textual content immediate for every. This offers us a clear, parallel workflow: one batch, two prompts, two photographs, two units of segmentation outcomes.
cat_url = "http://photographs.cocodataset.org/val2017/000000077595.jpg"
kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
photographs = [
Image.open(requests.get(cat_url, stream=True).raw).convert("RGB"),
Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
]
text_prompts = ["ear", "dial"]
inputs = processor(photographs=photographs, textual content=text_prompts, return_tensors="pt").to(gadget)
with torch.no_grad():
outputs = mannequin(**inputs)
# Submit-process outcomes for each photographs
outcomes = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs.get("original_sizes").tolist()
)
print(f"Picture 1: {len(outcomes[0]['masks'])} objects discovered")
print(f"Picture 2: {len(outcomes[1]['masks'])} objects discovered")
First, we outline two URLs. The primary factors to a cat picture. The second factors to a kitchen scene from COCO (Traces 1 and a pair of).
Subsequent, we obtain the 2 photographs, load them into reminiscence, and convert them to RGB. We retailer each photographs in a listing. This permits us to batch them later. Then, we outline one immediate per picture. The primary immediate searches for a cat’s ear. The second immediate seems to be for a dial within the kitchen scene (Traces 3-8).
We batch the pictures and batch the prompts right into a single enter construction. This offers SAM 3 two parallel vision-language duties, packed into one tensor (Line 10).
We disable gradient computation and run the mannequin in inference mode. The outputs comprise segmentation predictions for each photographs. We post-process the uncooked logits. SAM 3 returns outcomes as a listing: one entry per picture. Every entry accommodates occasion masks, bounding containers, and confidence scores (Traces 12-21).
We depend the variety of objects detected for every immediate. This offers us a easy, semantic abstract of mannequin efficiency (Traces 23 and 24).
Beneath is the entire variety of objects detected in every picture offered for every textual content immediate.
Picture 1: 2 objects discovered
Picture 2: 7 objects discovered
Output
for picture, outcome, immediate in zip(photographs, outcomes, text_prompts): labels = [prompt] * len(outcome["scores"]) vis = overlay_masks_boxes_scores(picture, outcome["masks"], outcome["boxes"], outcome["scores"], labels) show(vis)
To visualise the output, we pair every picture with its corresponding immediate and outcome. For every batch entry, we do the next (Line 1):
- create a label per detected object (Line 2)
- visualize the masks, containers, and scores utilizing our overlay helper (Line 3)
- show the annotated outcome within the pocket book (Line 4)
This method reveals how SAM 3 handles a number of textual content prompts and pictures concurrently, with out writing separate inference loops.
In Determine 2, we will see the item (ear) detected within the picture.

In Determine 3, we will see the item (dial) detected within the picture.

Single Bounding Field Immediate
On this instance, we carry out segmentation utilizing a bounding field as a substitute of a textual content immediate. We offer the mannequin with a spatial trace that claims: “focus right here.” SAM 3 then segments all detected cases of an idea offered by the spatial trace.
# Load picture
image_url = "http://photographs.cocodataset.org/val2017/000000077595.jpg"
picture = Picture.open(requests.get(image_url, stream=True).uncooked).convert("RGB")
# Field in xyxy format: [x1, y1, x2, y2]
box_xyxy = [100, 150, 500, 450]
input_boxes = [[box_xyxy]]
input_boxes_labels = [[1]] # 1 = constructive (foreground) field
def draw_input_box(picture, field, colour="pink", width=3):
img = picture.copy().convert("RGB")
draw = ImageDraw.Draw(img)
x1, y1, x2, y2 = field
draw.rectangle([(x1, y1), (x2, y2)], define=colour, width=width)
return img
input_box_vis = draw_input_box(picture, box_xyxy)
input_box_vis
First, we load an instance COCO picture straight from a URL. We learn the uncooked bytes, open them with Pillow, and convert them to RGB (Traces 2 and three).
Subsequent, we outline a bounding field across the area to be segmented. The coordinates comply with the xyxy format (Line 6).
(x1, y1): top-left nook(x2, y2): bottom-right nook
We put together the field for the processor.
- The outer checklist signifies a batch measurement of 1. The internal checklist holds the one bounding field (Line 8).
- We set the label to
1, that means this can be a constructive field, and SAM 3 ought to give attention to this area (Line 9).
Then, we outline a helper to visualise the immediate field. The perform attracts a coloured rectangle over the picture, making the immediate simple to confirm earlier than segmentation (Traces 11-16).
We show the enter field overlay. This confirms our immediate is appropriate earlier than operating the mannequin (Traces 18 and 19).
Determine 4 reveals the bounding field immediate overlaid on the enter picture.

inputs = processor(
photographs=picture,
input_boxes=input_boxes,
input_boxes_labels=input_boxes_labels,
return_tensors="pt"
).to(gadget)
with torch.no_grad():
outputs = mannequin(**inputs)
outcomes = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Discovered {len(outcomes['masks'])} objects")
Now, we put together the ultimate inputs for the mannequin. As an alternative of passing textual content, we go bounding field prompts. The processor handles resizing, padding, normalization, and tensor conversion. We then transfer the whole lot to the chosen gadget (GPU or CPU) (Traces 1-6).
We run SAM 3 in inference mode. The torch.no_grad() perform disables gradient computation, lowering reminiscence utilization and enhancing velocity (Traces 8 and 9).
After inference, we reshape and threshold the expected masks. We resize them again to their authentic sizes so that they align completely. We index [0] as a result of we’re working with a single picture (Traces 11-16).
We print the variety of foreground objects that SAM 3 detected throughout the bounding field (Line 18).
Discovered 1 objects
Output
labels = ["box-prompted object"] * len(outcomes["scores"]) overlay_masks_boxes_scores( picture=picture, masks=outcomes["masks"], containers=outcomes["boxes"], scores=outcomes["scores"], labels=labels, score_threshold=0.5, alpha=0.45, )
To visualise the outcomes, we create a label string "box-prompted object" for every detected occasion to maintain the overlay trying clear (Line 1).
Lastly, we name our overlay helper. It blends the segmentation masks, attracts the bounding field, and reveals confidence scores on prime of the unique picture (Traces 3-11).
Determine 5 reveals the segmented object.

A number of Bounding Field Prompts on a Single Picture (Twin Optimistic Foreground Areas)
On this instance, we information SAM 3 utilizing two constructive bounding containers. Every field marks a small area of curiosity contained in the picture: one across the oven dial and one round a close-by button. Each containers act as foreground alerts. SAM 3 then segments all detected objects inside these marked areas.
kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Picture.open(
requests.get(kitchen_url, stream=True).uncooked
).convert("RGB")
box1_xyxy = [59, 144, 76, 163] # Dial
box2_xyxy = [87, 148, 104, 159] # Button
input_boxes = [[box1_xyxy, box2_xyxy]]
input_boxes_labels = [[1, 1]] # 1 = constructive (foreground)
def draw_input_boxes(picture, containers, colour="pink", width=3):
img = picture.copy().convert("RGB")
draw = ImageDraw.Draw(img)
for field in containers:
x1, y1, x2, y2 = field
draw.rectangle([(x1, y1), (x2, y2)], define=colour, width=width)
return img
input_box_vis = draw_input_boxes(
kitchen_image,
[box1_xyxy, box2_xyxy]
)
input_box_vis
First, we load the kitchen picture from COCO. We obtain the uncooked picture bytes, open them with Pillow, and convert the picture to RGB. Subsequent, we outline two bounding containers. Each comply with the xyxy format. The primary field highlights the oven dial. The second field highlights the oven button (Traces 1-7).
We pack each bounding containers right into a single checklist, since we’re working with a single picture. We assign a worth of 1 to each containers, indicating that each are constructive prompts. We outline a helper perform to visualise the bounding field prompts. For every field, we draw a pink rectangle overlay on a duplicate of the picture (Traces 9-20).
We draw each containers and show the outcome. This offers us a visible affirmation of our bounding field prompts earlier than operating the mannequin (Traces 22-27).
Determine 6 reveals the 2 constructive bounding containers superimposed on the enter picture.

inputs = processor(
photographs=kitchen_image,
input_boxes=input_boxes,
input_boxes_labels=input_boxes_labels,
return_tensors="pt"
).to(gadget)
with torch.no_grad():
outputs = mannequin(**inputs)
outcomes = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Discovered {len(outcomes['masks'])} objects")
Now, we put together the picture and the bounding field prompts utilizing the processor. We then ship the tensors to the CPU or GPU. We run SAM 3 in inference mode. We disable gradient monitoring to enhance reminiscence and velocity (Traces 1-9).
Subsequent, we post-process the uncooked outputs. We resize masks again to their authentic form, and we filter low-confidence outcomes. We print the variety of detected objects that fall inside our two constructive bounding field prompts (Traces 11-18).
Beneath is the entire variety of objects detected within the picture.
Discovered 7 objects
Output
labels = ["box-prompted object"] * len(outcomes["scores"]) overlay_masks_boxes_scores( picture=kitchen_image, masks=outcomes["masks"], containers=outcomes["boxes"], scores=outcomes["scores"], labels=labels, )
We generate a label for visualization. Lastly, we overlay the segmented objects on the picture utilizing the overlay_masks_boxes_scores perform (Traces 1-9).
Right here, Determine 7 shows all segmented objects.

A number of Bounding Field Prompts on a Single Picture (Optimistic Foreground and Adverse Background Management)
On this instance, we information SAM 3 utilizing two bounding containers: one constructive and one unfavorable. The constructive field highlights the area we wish to section, whereas the unfavorable field tells the mannequin to disregard a close-by area. This mix provides us nice management over the segmentation outcome.
kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Picture.open(
requests.get(kitchen_url, stream=True).uncooked
).convert("RGB")
box1_xyxy = [59, 144, 76, 163] # Dial
box2_xyxy = [87, 148, 104, 159] # Button
input_boxes = [[box1_xyxy, box2_xyxy]]
input_boxes_labels = [[1, 0]]
def draw_input_boxes(picture, containers, labels, width=3):
"""
containers : checklist of [x1, y1, x2, y2]
labels : checklist of ints (1 = constructive, 0 = unfavorable)
"""
img = picture.copy().convert("RGB")
draw = ImageDraw.Draw(img)
for field, label in zip(containers, labels):
x1, y1, x2, y2 = field
# Coloration by label
colour = "inexperienced" if label == 1 else "pink"
draw.rectangle(
[(x1, y1), (x2, y2)],
define=colour,
width=width,
)
return img
input_box_vis = draw_input_boxes(
kitchen_image,
containers=[box1_xyxy, box2_xyxy],
labels=[1, 0], # 1 = constructive, 0 = unfavorable
)
input_box_vis
First, we load our kitchen picture from the COCO dataset. We fetch the bytes from the URL and convert them to RGB (Traces 1-4).
Subsequent, we outline two bounding containers. Each comply with the xyxy coordinate format (Traces 6 and seven):
- first field: surrounds the oven dial
- second field: surrounds a close-by oven button
We pack the 2 containers right into a single checklist as a result of we’re working with a single picture. We set labels [1, 0], that means (Traces 9 and 10):
- dial field: constructive (foreground to incorporate)
- button field: unfavorable (space to exclude)
We outline a helper perform that attracts bounding containers in several colours. Optimistic prompts are drawn in inexperienced. Adverse prompts are drawn in pink (Traces 12-32).
We visualize the bounding field prompts overlaid on the picture. This offers us a transparent understanding of how we’re instructing SAM 3 (Traces 34-40).
Determine 8 reveals the constructive and unfavorable field prompts superimposed on the enter picture.

inputs = processor(
photographs=kitchen_image,
input_boxes=input_boxes,
input_boxes_labels=input_boxes_labels,
return_tensors="pt"
).to(gadget)
with torch.no_grad():
outputs = mannequin(**inputs)
outcomes = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Discovered {len(outcomes['masks'])} objects")
We put together the inputs for SAM 3. The processor handles preprocessing and tensor conversion. We carry out inference. Gradients are disabled to cut back reminiscence utilization. Subsequent, we post-process the outcomes. SAM 3 returns occasion masks filtered by confidence and resized to the unique decision (Traces 1-16).
We print the variety of objects segmented utilizing this foreground-background mixture (Line 18).
Beneath is the entire variety of objects detected within the picture.
Discovered 6 objects
Output
labels = ["box-prompted object"] * len(outcomes["scores"]) overlay_masks_boxes_scores( picture=kitchen_image, masks=outcomes["masks"], containers=outcomes["boxes"], scores=outcomes["scores"], labels=labels, )
We assign labels to detections to make sure the overlay shows significant textual content. Lastly, we visualize the segmentation (Traces 1-9).
In Determine 9, the constructive immediate guides SAM 3 to section the dial, whereas the unfavorable immediate suppresses the close by button.

Combining Textual content and Visible Prompts for Selective Segmentation (Excluding the Undesired Areas)
On this instance, we use two totally different immediate varieties on the similar time:
- textual content immediate: to seek for
"deal with" - unfavorable bounding field: to exclude the oven deal with area
This supplies selective management, permitting SAM 3 to give attention to handles within the scene whereas ignoring a particular space.
kitchen_url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Picture.open(
requests.get(kitchen_url, stream=True).uncooked
).convert("RGB")
# Phase "deal with" however exclude the oven deal with utilizing a unfavorable field
textual content = "deal with"
# Adverse field overlaying oven deal with space (xyxy): [40, 183, 318, 204]
oven_handle_box = [40, 183, 318, 204]
input_boxes = [[oven_handle_box]]
def draw_negative_box(picture, field, width=3):
img = picture.copy().convert("RGB")
draw = ImageDraw.Draw(img)
x1, y1, x2, y2 = field
draw.rectangle(
[(x1, y1), (x2, y2)],
define="pink", # pink = unfavorable
width=width,
)
return img
neg_box_vis = draw_negative_box(
kitchen_image,
oven_handle_box
)
neg_box_vis
First, we load the kitchen picture from the COCO dataset. We learn the file from the URL, open it as a Pillow picture, and convert it to RGB (Traces 1-4).
Subsequent, we outline the construction of our immediate. We wish to section handles within the kitchen, however exclude the massive oven deal with. We describe the idea utilizing textual content ("deal with") and draw a bounding field over the oven deal with area (Traces 7-10).
We write a helper perform to visualise our unfavorable area. We draw a pink bounding field to indicate that this space needs to be excluded. We show the unfavorable immediate overlay. This helps verify that the area is positioned accurately (Traces 12-30).
Figure 10 reveals the bounding field immediate to exclude the oven deal with area.

inputs = processor(
photographs=kitchen_image,
textual content="deal with",
input_boxes=[[oven_handle_box]],
input_boxes_labels=[[0]], # unfavorable field
return_tensors="pt"
).to(gadget)
with torch.no_grad():
outputs = mannequin(**inputs)
outcomes = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Discovered {len(outcomes['masks'])} objects")
Right here, we put together the inputs for SAM 3. We mix textual content and bounding field prompts. We mark the bounding field with a 0 label, that means it’s a unfavorable area that the mannequin should ignore (Traces 1-7).
We run the mannequin in inference mode. This yields uncooked segmentation predictions primarily based on each immediate varieties. We post-process the outcomes by changing logits into binary masks, filtering low-confidence predictions, and resizing the masks again to the unique decision (Traces 9-17).
We report under the variety of handle-like objects remaining after excluding the oven deal with (Line 19).
Discovered 3 objects
Output
labels = ["handle (excluding oven)"] * len(outcomes["scores"]) final_vis = overlay_masks_boxes_scores( picture=kitchen_image, masks=outcomes["masks"], containers=outcomes["boxes"], scores=outcomes["scores"], labels=labels, score_threshold=0.5, alpha=0.45, ) final_vis
We assign significant labels for visualization. Lastly, we draw masks, bounding containers, labels, and scores on the picture (Traces 1-13).
In Determine 11, the outcome reveals solely handles outdoors the unfavorable area.

"deal with" segmentation whereas excluding the oven deal with through a unfavorable field (supply: visualization by the creator)Batched Blended-Immediate Segmentation Throughout Two Photos (Textual content and Bounding Field Steering)
On this instance, we exhibit how SAM 3 can deal with a number of immediate varieties in a single batch. The primary picture receives a textual content immediate ("laptop computer"), whereas the second picture receives a visible immediate (constructive bounding field). Each photographs are processed collectively in a single ahead go.
textual content=["laptop", None]
input_boxes=[None, [box2_xyxy]]
input_boxes_labels=[None, [1]]
def draw_input_box(picture, field, colour="inexperienced", width=3):
img = picture.copy().convert("RGB")
draw = ImageDraw.Draw(img)
x1, y1, x2, y2 = field
draw.rectangle([(x1, y1), (x2, y2)], define=colour, width=width)
return img
input_vis_1 = photographs[0] # textual content immediate → no field
input_vis_2 = draw_input_box(photographs[1], box2_xyxy)
First, we outline 3 parallel immediate lists:
- 1 for textual content
- 1 for bounding containers
- 1 for bounding field labels
We set the primary entry in every checklist to None for the primary picture as a result of we solely wish to use pure language there (laptop computer). For the second picture, we provide a bounding field and label it as constructive (1) (Traces 1-3).
We outline a small helper perform to attract a bounding field on a picture. This helps us visualize the immediate area earlier than inference. Right here, we put together two preview photographs (Traces 5-13):
- first picture: reveals no field, since it’ll use textual content solely
- second picture: is rendered with its bounding field immediate
input_vis_1
Determine 12 reveals no field over the picture, because it makes use of a textual content immediate for segmentation.

input_vis_2
Determine 13 reveals a bounding field over the picture as a result of it makes use of a field immediate for segmentation.

inputs = processor( photographs=photographs, textual content=["laptop", None], input_boxes=[None, [box2_xyxy]], input_boxes_labels=[None, [1]], return_tensors="pt" ).to(gadget) with torch.no_grad(): outputs = mannequin(**inputs) outcomes = processor.post_process_instance_segmentation( outputs, threshold=0.5, mask_threshold=0.5, target_sizes=inputs["original_sizes"].tolist() )
Subsequent, we assemble the whole lot right into a single batched enter. This offers SAM 3:
- 2 photographs
- 2 immediate varieties
- 1 ahead go
We run SAM 3 inference with out computing gradients. This produces segmentation predictions for each photographs concurrently (Traces 1-10).
We post-process the mannequin outputs for each photographs. The result’s a two-element checklist (Traces 12-17):
- entry
[0]: corresponds to the laptop computer question - entry
[1]: corresponds to the bounding field question
Output 1: Textual content Immediate Segmentation
labels_1 = ["laptop"] * len(outcomes[0]["scores"]) overlay_masks_boxes_scores( picture=photographs[0], masks=outcomes[0]["masks"], containers=outcomes[0]["boxes"], scores=outcomes[0]["scores"], labels=labels_1, score_threshold=0.5, )
We apply a label to every detected object within the first picture. We visualize the segmentation outcomes overlaid on the primary picture (Traces 1-10).
In Determine 14, we observe detections guided by the textual content immediate "laptop computer".

"laptop computer" in Picture 1 (supply: visualization by the creator)Output 2: Bounding Field Immediate Segmentation
labels_2 = ["box-prompted object"] * len(outcomes[1]["scores"]) overlay_masks_boxes_scores( picture=photographs[1], masks=outcomes[1]["masks"], containers=outcomes[1]["boxes"], scores=outcomes[1]["scores"], labels=labels_2, score_threshold=0.5, )
We create labels for the second picture. These detections are from the bounding field immediate. Lastly, we visualize the bounding field guided segmentation on the second picture (Traces 1-10).
In Determine 15, we will see the detections guided by the bounding field immediate.

Interactive Segmentation Utilizing Bounding Field Refinement (Draw to Phase)
On this instance, we flip segmentation into a completely interactive workflow. We draw bounding containers straight over the picture utilizing a widget UI. Every drawn field turns into a immediate sign for SAM 3:
- inexperienced (constructive) containers: establish areas we wish to section
- pink (unfavorable) containers: exclude areas we wish the mannequin to disregard
After drawing, we convert the widget output into correct field coordinates and run SAM 3 to supply refined segmentation masks.
output.enable_custom_widget_manager()
# Load picture
url = "http://photographs.cocodataset.org/val2017/000000136466.jpg"
picture = Picture.open(requests.get(url, stream=True).uncooked).convert("RGB")
# Convert to base64
def pil_to_base64(img):
buffer = io.BytesIO()
img.save(buffer, format="PNG")
return "information:picture/png;base64," + base64.b64encode(buffer.getvalue()).decode()
# Create widget
widget = BBoxWidget(
picture=pil_to_base64(picture),
lessons=["positive", "negative"]
)
widget
We allow customized widget help in Colab to make sure the bounding field UI renders correctly. We obtain the kitchen picture, load it into reminiscence, and convert it to RGB format (Traces 1-5).
Earlier than sending the picture into the widget, we convert it right into a base64 PNG buffer. This encoding step makes the picture displayable within the browser UI (Traces 8-11).
We create an interactive drawing widget. It shows the picture and permits the consumer so as to add labeled containers. Every field is tagged as both "constructive" or "unfavorable" (Traces 14-17).
We render the widget within the pocket book. At this level, the consumer can draw, transfer, resize, and delete bounding containers (Line 19).
In Determine 16, we will see the constructive and unfavorable bounding containers drawn by the consumer. The blue field signifies areas that belong to the item of curiosity, whereas the orange field marks background areas that needs to be ignored. These annotations function interactive steering alerts for refining the segmentation output.

print(widget.bboxes)
The widget.bboxes object shops metadata for each annotation drawn by the consumer on the picture. Every entry corresponds to a single field created within the interactive widget.
A typical output seems to be like this:
[{'x': 58, 'y': 147, 'width': 18, 'height': 18, 'label': 'positive'}, {'x': 88, 'y': 149, 'width': 18, 'height': 8, 'label': 'negative'}]
Every dictionary represents a single consumer annotation:
xandy: point out the top-left nook of the drawn field in pixel coordinateswidthandpeak: describe the dimensions of the fieldlabel: tells us whether or not the annotation is a'constructive'level (object) or a'unfavorable'level (background)
def widget_to_sam_boxes(widget):
containers = []
labels = []
for ann in widget.bboxes:
x = int(ann["x"])
y = int(ann["y"])
w = int(ann["width"])
h = int(ann["height"])
x1 = x
y1 = y
x2 = x + w
y2 = y + h
label = ann.get("label") or ann.get("class")
containers.append([x1, y1, x2, y2])
labels.append(1 if label == "constructive" else 0)
return containers, labels
containers, box_labels = widget_to_sam_boxes(widget)
print("Containers:", containers)
print("Labels:", box_labels)
We outline a helper perform to translate widget information into SAM-compatible xyxy coordinates. The widget provides us x/y + width/peak. We convert to SAM’s xyxy format.
We encode labels into SAM 3 format:
1: constructive area0: unfavorable area
The perform returns legitimate field lists prepared for inference. We extract the interactive field prompts (Traces 23-45).
Beneath are the Containers and Labels within the required format.
Containers: [[58, 147, 76, 165], [88, 149, 106, 157]]
Labels: [1, 0]
inputs = processor(
photographs=picture,
input_boxes=[boxes], # batch measurement = 1
input_boxes_labels=[box_labels],
return_tensors="pt"
).to(gadget)
with torch.no_grad():
outputs = mannequin(**inputs)
outcomes = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Discovered {len(outcomes['masks'])} objects")
We go the picture and interactive field prompts into the processor. We run inference with out monitoring gradients. We convert logits into ultimate masks predictions. We print the variety of detected areas matching the interactive prompts (Traces 49-66).
Beneath is the variety of objects detected by the mannequin.
Discovered 6 objects
Output
labels = ["interactive object"] * len(outcomes["scores"]) overlay_masks_boxes_scores( picture=picture, masks=outcomes["masks"], containers=outcomes["boxes"], scores=outcomes["scores"], labels=labels, alpha=0.45, )
We assign easy labels to every detected area and overlay masks, bounding containers, and scores on the unique picture (Traces 1-10).
This workflow demonstrates an efficient use case: human-guided refinement via dwell drawing instruments. With just some annotations, SAM 3 adapts the segmentation output, giving us precision management and quick visible suggestions.
In Determine 17, we will see the segmented areas based on the constructive and unfavorable bounding field prompts annotated by the consumer over the enter picture.

Interactive Segmentation Utilizing Level-Based mostly Refinement (Click on to Information the Mannequin)
On this instance, we section utilizing level prompts somewhat than textual content or bounding containers. We click on on the picture to mark constructive and unfavorable factors. The middle of every clicked level turns into a guiding coordinate, and SAM 3 makes use of these coordinates to refine segmentation. This workflow supplies fine-grained, pixel-level management, nicely fitted to interactive modifying or correction.
# Setup gadget
gadget = Accelerator().gadget
# Load mannequin and processor
print("Loading SAM3 mannequin...")
mannequin = Sam3TrackerModel.from_pretrained("fb/sam3").to(gadget)
processor = Sam3TrackerProcessor.from_pretrained("fb/sam3")
print("Mannequin loaded efficiently!")
# Load picture
IMAGE_PATH = "/content material/dog-2.jpeg"
raw_image = Picture.open(IMAGE_PATH).convert("RGB")
def pil_to_base64(img):
"""Convert PIL picture to base64 for BBoxWidget"""
buffer = io.BytesIO()
img.save(buffer, format="PNG")
return "information:picture/png;base64," + base64.b64encode(buffer.getvalue()).decode()
We arrange our compute gadget utilizing the Accelerator() class. This robotically detects the GPU if accessible. We load the SAM 3 monitoring mannequin and processor. This variant helps point-based refinement and multi-mask output (Traces 2-7).
We load the canine picture into reminiscence and convert it to RGB format. The BBoxWidget expects picture information in base64 format. We write a helper perform to transform a PIL picture to base64 (Traces 11-18).
def get_points_from_widget(widget):
"""Extract level coordinates from widget bboxes"""
positive_points = []
negative_points = []
for ann in widget.bboxes:
x = int(ann["x"])
y = int(ann["y"])
w = int(ann["width"])
h = int(ann["height"])
# Get heart level of the bbox
center_x = x + w // 2
center_y = y + h // 2
label = ann.get("label") or ann.get("class")
if label == "constructive":
positive_points.append([center_x, center_y])
elif label == "unfavorable":
negative_points.append([center_x, center_y])
return positive_points, negative_points
We loop over bounding containers drawn on the widget and convert them into level coordinates. Every tiny bounding field turns into a middle level. We break up them into (Traces 20-42):
- constructive factors: object
- unfavorable factors: background
def segment_from_widget(b=None):
"""Run segmentation with factors from widget"""
positive_points, negative_points = get_points_from_widget(widget)
if not positive_points and never negative_points:
print("⚠️ Please add a minimum of one level (draw small containers on the picture)!")
return
# Mix factors and labels
all_points = positive_points + negative_points
all_labels = [1] * len(positive_points) + [0] * len(negative_points)
print(f"n🔄 Working segmentation...")
print(f" • {len(positive_points)} constructive factors: {positive_points}")
print(f" • {len(negative_points)} unfavorable factors: {negative_points}")
# Put together inputs (4D for factors, 3D for labels)
input_points = [[all_points]] # [batch, object, points, xy]
input_labels = [[all_labels]] # [batch, object, labels]
inputs = processor(
photographs=raw_image,
input_points=input_points,
input_labels=input_labels,
return_tensors="pt"
).to(gadget)
# Run inference
with torch.no_grad():
outputs = mannequin(**inputs)
# Submit-process masks
masks = processor.post_process_masks(
outputs.pred_masks.cpu(),
inputs["original_sizes"]
)[0]
print(f"✅ Generated {masks.form[1]} masks with form {masks.form}")
# Visualize outcomes
visualize_results(masks, positive_points, negative_points)
This segment_from_widget perform handles (Traces 44-83):
- studying constructive + unfavorable factors (Traces 46-58)
- constructing SAM 3 inputs (Traces 60-68)
- operating inference (Traces 71 and 72)
- post-processing masks (Traces 75-78)
- visualizing outcomes (Line 83)
We pack factors and labels into the proper mannequin format. The mannequin generates a number of ranked masks. Higher high quality masks seem at index 0.
def visualize_results(masks, positive_points, negative_points):
"""Show segmentation outcomes"""
n_masks = masks.form[1]
# Create determine with subplots
fig, axes = plt.subplots(1, min(n_masks, 3), figsize=(15, 5))
if n_masks == 1:
axes = [axes]
for idx in vary(min(n_masks, 3)):
masks = masks[0, idx].numpy()
# Overlay masks on picture
img_array = np.array(raw_image)
colored_mask = np.zeros_like(img_array)
colored_mask[mask > 0] = [0, 255, 0] # Inexperienced masks
overlay = img_array.copy()
overlay[mask > 0] = (img_array[mask > 0] * 0.5 + colored_mask[mask > 0] * 0.5).astype(np.uint8)
axes[idx].imshow(overlay)
axes[idx].set_title(f"Masks {idx + 1} (High quality Ranked)", fontsize=12, fontweight="daring")
axes[idx].axis('off')
# Plot factors on every masks
for px, py in positive_points:
axes[idx].plot(px, py, 'go', markersize=12, markeredgecolor="white", markeredgewidth=2.5)
for nx, ny in negative_points:
axes[idx].plot(nx, ny, 'ro', markersize=12, markeredgecolor="white", markeredgewidth=2.5)
plt.tight_layout()
plt.present()
We overlay segmentation masks over the unique picture. Optimistic factors are displayed as inexperienced dots. Adverse factors are proven in pink (Traces 85-116).
def reset_widget(b=None):
"""Clear all annotations"""
widget.bboxes = []
print("🔄 Reset! All factors cleared.")
This clears beforehand chosen factors so we will begin recent (Traces 118-121).
# Create widget for level choice widget = BBoxWidget( picture=pil_to_base64(raw_image), lessons=["positive", "negative"] )
Customers can click on so as to add factors wherever on the picture. The widget captures each place and label (Traces 124-127).
# Create UI buttons segment_button = widgets.Button( description='🎯 Phase', button_style="success", tooltip='Run segmentation with marked factors', icon='test', format=widgets.Structure(width="150px", peak="40px") ) segment_button.on_click(segment_from_widget) reset_button = widgets.Button( description='🔄 Reset', button_style="warning", tooltip='Clear all factors', icon='refresh', format=widgets.Structure(width="150px", peak="40px") ) reset_button.on_click(reset_widget)
We create UI buttons for:
- operating segmentation (Traces 130-137)
- clearing annotations (Traces 139-146)
# Show UI
print("=" * 70)
print("🎨 INTERACTIVE SAM3 SEGMENTATION WITH BOUNDING BOX WIDGET")
print("=" * 70)
print("n📋 Directions:")
print(" 1. Draw SMALL containers on the picture the place you wish to mark factors")
print(" 2. Label them as 'constructive' (object) or 'unfavorable' (background)")
print(" 3. The CENTER of every field shall be used as some extent coordinate")
print(" 4. Click on 'Phase' button to run SAM3")
print(" 5. Click on 'Reset' to clear all factors and begin over")
print("n💡 Suggestions:")
print(" • Draw tiny containers - simply large enough to see")
print(" • Optimistic factors = components of the item you need")
print(" • Adverse factors = background areas to exclude")
print("n" + "=" * 70 + "n")
show(widgets.HBox([segment_button, reset_button]))
show(widget)
We render the interface side-by-side. The consumer can now:
- click on constructive factors
- click on unfavorable factors
- run segmentation dwell
- reset anytime
Output
In Determine 18, we will see the entire point-based segmentation course of.
What’s subsequent? We advocate PyImageSearch College.
86+ complete lessons • 115+ hours hours of on-demand code walkthrough movies • Final up to date: February 2026
★★★★★ 4.84 (128 Scores) • 16,000+ College students Enrolled
I strongly consider that in the event you had the suitable instructor you would grasp pc imaginative and prescient and deep studying.
Do you assume studying pc imaginative and prescient and deep studying needs to be time-consuming, overwhelming, and sophisticated? Or has to contain complicated arithmetic and equations? Or requires a level in pc science?
That’s not the case.
All it’s worthwhile to grasp pc imaginative and prescient and deep studying is for somebody to clarify issues to you in easy, intuitive phrases. And that’s precisely what I do. My mission is to vary schooling and the way complicated Synthetic Intelligence subjects are taught.
For those who’re critical about studying pc imaginative and prescient, your subsequent cease needs to be PyImageSearch College, probably the most complete pc imaginative and prescient, deep studying, and OpenCV course on-line at this time. Right here you’ll discover ways to efficiently and confidently apply pc imaginative and prescient to your work, analysis, and initiatives. Be a part of me in pc imaginative and prescient mastery.
Inside PyImageSearch College you may discover:
- &test; 86+ programs on important pc imaginative and prescient, deep studying, and OpenCV subjects
- &test; 86 Certificates of Completion
- &test; 115+ hours hours of on-demand video
- &test; Model new programs launched repeatedly, making certain you possibly can sustain with state-of-the-art strategies
- &test; Pre-configured Jupyter Notebooks in Google Colab
- &test; Run all code examples in your internet browser — works on Home windows, macOS, and Linux (no dev setting configuration required!)
- &test; Entry to centralized code repos for all 540+ tutorials on PyImageSearch
- &test; Simple one-click downloads for code, datasets, pre-trained fashions, and so forth.
- &test; Entry on cell, laptop computer, desktop, and so forth.
Abstract
In Half 2 of this tutorial, we explored the superior capabilities of SAM 3, reworking it from a robust segmentation instrument into a versatile, interactive visible question system. We demonstrated the right way to leverage a number of immediate varieties (textual content, bounding containers, and factors) each individually and together to attain exact, context-aware segmentation outcomes.
We lined subtle workflows, together with:
- Segmenting a number of ideas concurrently in the identical picture
- Processing batches of photographs with totally different prompts effectively
- Utilizing constructive bounding containers to give attention to areas of curiosity
- Using unfavorable prompts to exclude undesirable areas
- Combining textual content and visible prompts for selective, fine-grained management
- Constructing absolutely interactive segmentation interfaces the place customers can draw containers or click on factors and see ends in real-time
These strategies showcase SAM 3’s versatility for real-world functions. Whether or not you’re constructing large-scale information annotation pipelines, creating clever video modifying instruments, creating AR experiences, or conducting scientific analysis, the multi-modal prompting capabilities we explored offer you pixel-perfect management over segmentation outputs.
Quotation Info
Thakur, P. “Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation,” PyImageSearch, P. Chugh, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, https://pyimg.co/5c4ag
@incollection{Thakur_2026_advanced-sam-3-multi-modal-prompting-and-interactive-segmentation,
creator = {Piyush Thakur},
title = {{Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation}},
booktitle = {PyImageSearch},
editor = {Puneet Chugh and Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
12 months = {2026},
url = {https://pyimg.co/5c4ag},
}
To obtain the supply code to this put up (and be notified when future tutorials are printed right here on PyImageSearch), merely enter your e-mail tackle within the kind under!

Obtain the Supply Code and FREE 17-page Useful resource Information
Enter your e-mail tackle under to get a .zip of the code and a FREE 17-page Useful resource Information on Laptop Imaginative and prescient, OpenCV, and Deep Studying. Inside you may discover my hand-picked tutorials, books, programs, and libraries that can assist you grasp CV and DL!
The put up Superior SAM 3: Multi-Modal Prompting and Interactive Segmentation appeared first on PyImageSearch.

