Interactive Segmentation Results
Drag the slider to preview how UnSAMv2 adjusts mask granularity. The closest available result is displayed automatically.
Whole Image Segmentation Results
Move the slider to explore how UnSAMv2 outputs masks for all instances at the desired granularity level.
Abstract
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually—by adding more prompts or selecting from pre-generated masks—to reach the desired level of detail. This process is ambiguous because a single prompt can map to several plausible masks, and densely annotating all granularities is prohibitively expensive, making supervised solutions infeasible.
To address this limitation, we introduce UnSAMv2, which enables segment-anything-at-any-granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM to discover abundant mask–granularity pairs and introduces a granularity control embedding that delivers precise, continuous control over segmentation scale. With only 6K unlabeled images and 0.02% additional parameters, UnSAMv2 substantially enhances SAM-2 across interactive, whole-image, and video segmentation tasks. Evaluated on more than 11 benchmarks, UnSAMv2 improves NoC90 (5.69 → 4.75), 1-IoU (58.0 → 73.1), and AR1000 (49.6 → 68.3), showing that modest amounts of unlabeled data with granularity-aware self-supervision can unlock the potential of vision foundation models.
Problem Statement
Lack of granularity control. When a single point corresponds to multiple plausible objects (e.g., a part versus the whole), SAM produces up to three discrete masks and leaves selection to the user. Without an explicit granularity variable, the model cannot smoothly traverse scales, so fine details and coarse structures stay disconnected. This constraint hampers interactive efficiency and blocks interpretable, continuous control over the desired level of detail.
Lack of hierarchical reasoning. Supervised training on human-labeled masks teaches SAM a flat representation where parts and semantic instances are isolated rather than arranged within a hierarchy. As a result, SAM lacks structural awareness, struggles with intermediate detail levels, and cannot expose the nested hierarchy of visual scenes. This limitation underscores the need for unsupervised learning that can recover hierarchical dependencies directly from image statistics instead of relying on costly annotations.
Our Solution
Unsupervised data pipeline. We extend the divide-and-conquer strategy of UnSAM to explicitly link instances and parts while computing mask granularity scores. The divide stage applies CutLER to produce instance-level masks; the conquer stage iteratively merges pixels based on cosine similarity using DINOv3 features within each instance to reveal part-level masks. This hierarchy yields granularity scales that allow UnSAMv2 to guide SAM-2 toward any desired detail.
Granularity training with UnSAMv2. Building on SAM-2, we design a Fourier-based granularity encoder and a granularity-aware mask token. A scalar target granularity g ∈ [0.1, 1] is lifted via Fourier features and an MLP, then fused with point and image embeddings inside the two-way transformer in mask decoder. The dedicated granularity-aware mask token attends to all three modalities (point, image, granularity) and the token decoder produces a mask precisely matching the requested granularity. Notably, this only adds 0.02% parameters to SAM-2.
Experimental Results
Interactive Segmentation. UnSAMv2 outperforms SAM-2 across both instance-level and part-level benchmarks. With only 6,000 unsupervised pseudo-labeled images, our model learns to respect the granularity scalar and segments objects at the requested detail. Averaged across benchmarks, UnSAMv2 surpasses SAM-2 by 15.2% in NoC80, 16.5% in NoC90, and 26.0% in 1-IoU, allowing users to achieve accurate masks with fewer prompt points.
Unsupervised + supervised training. When we combine our unsupervised pipeline with limited supervision (UnSAMv2+), the model achieves state-of-the-art interactive performance on multiple datasets. Compared with GraCo, UnSAMv2+ improves NoC80 by 11.7%, NoC90 by 9.36%, and 1-IoU by 9.81%, demonstrating the effectiveness of our self-supervised pipeline.
Whole-image segmentation. Beyond point-based interaction, UnSAMv2 attains state-of-the-art performance on whole-image segmentation across datasets rich in granularity variation, surpassing SAM by 37.7% and UnSAM by 29.8% in AR1000. With a single granularity scalar as input, users can reveal instances over a wide spectrum of detail, simply dialing the desired granularity to surface every candidate mask at that level and integrate segmentation seamlessly into downstream pipelines.
BibTeX
@article{yu2025unsamv2,
title={UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity},
author={Yu, Junwei and Darrell, Trevor and Wang, XuDong},
journal={arXiv preprint arXiv:2511.13714},
year={2025}
}