UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

Yu, Junwei; Darrell, Trevor; Wang, XuDong

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

Junwei Yu, Trevor Darrell, XuDong Wang^*

UC Berkeley
Preprint, 2025
^*Corresponding Author

Paper arXiv HF Demo Code

0.10 1.00

Granularity: 0.10

Interactive Segmentation Results

Drag the slider to preview how UnSAMv2 adjusts mask granularity. The closest available result is displayed automatically.

0.10 1.00

Granularity: 0.10

0.10 1.00

Granularity: 0.10

0.10 1.00

Granularity: 0.10

0.10 1.00

Granularity: 0.10

0.10 1.00

Granularity: 0.10

0.10 1.00

Granularity: 0.10

Whole Image Segmentation Results

Move the slider to explore how UnSAMv2 outputs masks for all instances at the desired granularity level.

0.15 0.91

Granularity: 0.15

0.17 1.00

Granularity: 0.17

0.12 0.82

Granularity: 0.12

0.20 0.97

Granularity: 0.20

Abstract

The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually—by adding more prompts or selecting from pre-generated masks—to reach the desired level of detail. This process is ambiguous because a single prompt can map to several plausible masks, and densely annotating all granularities is prohibitively expensive, making supervised solutions infeasible.

To address this limitation, we introduce UnSAMv2, which enables segment-anything-at-any-granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM to discover abundant mask–granularity pairs and introduces a granularity control embedding that delivers precise, continuous control over segmentation scale. With only 6K unlabeled images and 0.02% additional parameters, UnSAMv2 substantially enhances SAM-2 across interactive, whole-image, and video segmentation tasks. Evaluated on more than 11 benchmarks, UnSAMv2 improves NoC₉₀ (5.69 → 4.75), 1-IoU (58.0 → 73.1), and AR₁₀₀₀ (49.6 → 68.3), showing that modest amounts of unlabeled data with granularity-aware self-supervision can unlock the potential of vision foundation models.

Problem Statement

Lack of granularity control. When a single point corresponds to multiple plausible objects (e.g., a part versus the whole), SAM produces up to three discrete masks and leaves selection to the user. Without an explicit granularity variable, the model cannot smoothly traverse scales, so fine details and coarse structures stay disconnected. This constraint hampers interactive efficiency and blocks interpretable, continuous control over the desired level of detail.

Lack of hierarchical reasoning. Supervised training on human-labeled masks teaches SAM a flat representation where parts and semantic instances are isolated rather than arranged within a hierarchy. As a result, SAM lacks structural awareness, struggles with intermediate detail levels, and cannot expose the nested hierarchy of visual scenes. This limitation underscores the need for unsupervised learning that can recover hierarchical dependencies directly from image statistics instead of relying on costly annotations.

Comparison of SAM-2 without granularity control versus UnSAMv2 producing hierarchical masks — UnSAMv2 enables continuous control over mask granularity, unlocking hierarchical segmentations in vanilla SAM-2.

Our Solution

Unsupervised data pipeline. We extend the divide-and-conquer strategy of UnSAM to explicitly link instances and parts while computing mask granularity scores. The divide stage applies CutLER to produce instance-level masks; the conquer stage iteratively merges pixels based on cosine similarity using DINOv3 features within each instance to reveal part-level masks. This hierarchy yields granularity scales that allow UnSAMv2 to guide SAM-2 toward any desired detail.

Divide-and-conquer pipeline producing hierarchical mask granularity — Divide-and-conquer pipeline: normalized cuts identify instances, iterative merging discovers parts while calibrating granularity.

Granularity training with UnSAMv2. Building on SAM-2, we design a Fourier-based granularity encoder and a granularity-aware mask token. A scalar target granularity g ∈ [0.1, 1] is lifted via Fourier features and an MLP, then fused with point and image embeddings inside the two-way transformer in mask decoder. The dedicated granularity-aware mask token attends to all three modalities (point, image, granularity) and the token decoder produces a mask precisely matching the requested granularity. Notably, this only adds 0.02% parameters to SAM-2.

UnSAMv2 architecture with Fourier granularity encoder and granularity mask token — UnSAMv2 architecture: Fourier features encode the granularity input, and a granularity-aware mask token guides SAM-2 to arbitrary segmentation detail.

Experimental Results

Interactive Segmentation. UnSAMv2 outperforms SAM-2 across both instance-level and part-level benchmarks. With only 6,000 unsupervised pseudo-labeled images, our model learns to respect the granularity scalar and segments objects at the requested detail. Averaged across benchmarks, UnSAMv2 surpasses SAM-2 by 15.2% in NoC₈₀, 16.5% in NoC₉₀, and 26.0% in 1-IoU, allowing users to achieve accurate masks with fewer prompt points.

Zero-shot interactive segmentation results comparing SAM-2 and UnSAMv2

Unsupervised + supervised training. When we combine our unsupervised pipeline with limited supervision (UnSAMv2+), the model achieves state-of-the-art interactive performance on multiple datasets. Compared with GraCo, UnSAMv2+ improves NoC₈₀ by 11.7%, NoC₉₀ by 9.36%, and 1-IoU by 9.81%, demonstrating the effectiveness of our self-supervised pipeline.

Interactive segmentation benchmarks comparing UnSAMv2+ with GraCo

Whole-image segmentation. Beyond point-based interaction, UnSAMv2 attains state-of-the-art performance on whole-image segmentation across datasets rich in granularity variation, surpassing SAM by 37.7% and UnSAM by 29.8% in AR₁₀₀₀. With a single granularity scalar as input, users can reveal instances over a wide spectrum of detail, simply dialing the desired granularity to surface every candidate mask at that level and integrate segmentation seamlessly into downstream pipelines.

Whole image segmentation performance of UnSAMv2

BibTeX

@article{yu2025unsamv2,
  title={UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity},
  author={Yu, Junwei and Darrell, Trevor and Wang, XuDong},
  journal={arXiv preprint arXiv:2511.13714},
  year={2025}
}

More Works from Our Lab

UnSAM: Segment Anything without Supervision

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

Interactive Segmentation Results

Whole Image Segmentation Results

Abstract

Problem Statement

Our Solution

Experimental Results

BibTeX