LLM2021· ICCV 2021· CLASSIC

DINO: Emerging Properties in Self-Supervised Vision Transformers

Caron et al. (Meta FAIR / Inria)

Self-distillation with no labels：学生预测教师 (EMA) 的 softmax，纯靠 multi-crop + centering + sharpening 学到能直接做分割的视觉特征。

arXiv:2104.14294

#ssl#vision#vit#self-distillation

核心贡献

01Self-distillation 形式 SSL：学生 g_θs 学教师 g_θt 的 softmax 分布，无 contrastive / 无负样本
02Multi-crop：2 个 global crop (≥50% 面积) + V 个 local crop；老师只看 global，学生全看
03Anti-collapse 双闸：centering（减均值）+ sharpening（教师 τt=0.04 < 学生 τs=0.1）
04Teacher = EMA of student：θt ← λθt + (1-λ)θs，λ 从 0.996 升到 1
05Emergent semantic segmentation：自监督训练出的 attention map 能直接做无监督前景分割
06ViT-B/8 在 ImageNet k-NN 78.3% / linear 80.1%，首次让 SSL 接近监督学习

DINO 把 SSL 从 contrastive 思路（SimCLR / MoCo）切到 self-distillation：没有负样本，没有 instance discrimination，纯靠让学生预测教师的输出分布。

text

 1                  multi-crop
 2       x ─────────────────────►  2 global (224²) + V local (96²)
 3                                      │
 4                          ┌───────────┴────────────┐
 5                          ▼                        ▼
 6                  student g_θs                teacher g_θt
 7                  (ALL views)                 (global only · EMA)
 8                          │                        │
 9                          ▼                        ▼
10              softmax(z_s / τs)         softmax((z_t − C) / τt)
11                          │                        │
12                          └──────► CE loss ◄───────┘

教师只看 global crop，学生看全部 crop，逼出「局部 → 全局」的预测——学生必须学到语义级别的特征才能让小 crop 的输出和大 crop 的对齐。

面试视角

和 BYOL 像不像？ 很像（都是 student/teacher EMA + 无负样本），但 DINO 用 cross-entropy on softmax distribution（不是直接 L2 距离 on representations），且 multi-crop 比 BYOL 的 2-view 更激进。

为什么不 collapse？ 两道闸门必须都开：(1) centering 减去教师 logits 的运行均值，防止所有样本塌缩到一个常数 vector；(2) sharpening 教师温度 τt=0.04 远低于学生 τs=0.1，让教师分布尖锐、学生分布平滑，二者方向相反互相牵制。只 center 会发散（不强制聚集），只 sharpen 会塌缩（所有样本聚到同一 mode）。

emergent segmentation 是真的吗？ 是。DINO ViT 的 [CLS] self-attention map 直接对应前景物体边界，无需任何分割标注。这是 supervised ViT 没有的性质——说明 SSL 学到了语义结构而非分类决策面。

DINOv2 改了什么？ 数据策略（curated 142M LVD dataset）+ 训练稳定化（KoLeo loss 防 collapse + iBOT 风格 masked image modeling 加 patch-level loss）+ scaling 到 ViT-g (1.1B)。架构不变，但产出的特征 generality 大涨——成为 LLaVA / Grounding DINO / Depth Anything 的默认视觉塔。

为什么 VLM 现在喜欢用 DINO 当视觉塔？ CLIP-ViT 强在 text alignment，但 dense feature（按位置的特征）不强（CLIP loss 只看 [CLS]）。DINO 的 patch-level token 自带语义结构，对 segmentation / grounding / VQA spatial 类任务更友好。