经典论文打底， 前沿周更跟进。

sslaudiospeechmasked-prediction

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Hsu et al. (FAIR)

把 BERT 的 masked prediction 真正搬到语音——用 k-means 聚类做「伪 phoneme」标签，迭代式 SSL 训练，比 wav2vec 2.0 更稳。

LLM2018· CLASSIC

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Luo & Mesgarani

第一个在 waveform 域端到端做 speech separation 的方法——直接 beat 之前的 STFT 路线，证明「信号处理预设」未必必要。

speech-separationaudioend-to-endwaveform

audio-generationcodeclanguage-modelspeech-synthesis

AudioLM: A Language Modeling Approach to Audio Generation

Borsos et al. (Google)

把音频当「语言」训：用 SoundStream codec 把波形离散化成 token，再用 Transformer 像 GPT 一样 next-token-predict——开启 codec+LM 范式。

audio-generationmusictext-to-audiocodec

MusicLM: Generating Music From Text

Agostinelli et al. (Google)

AudioLM 加文本条件 = text-to-music。MuLan (text-audio CLIP) 做对齐 + AudioLM 做生成，能写出 5 分钟连贯音乐。

MLLM2023· CLASSIC

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Chu et al. (Alibaba)

把 Whisper encoder 接 Qwen LLM——一个模型搞定 ASR / 翻译 / 音频字幕 / SED / 情感识别 30+ 任务的「audio GPT-4o」开源版。

audio-llmmultitaskspeechinstruction-tuning

sslvisionvitself-distillation

DINO: Emerging Properties in Self-Supervised Vision Transformers

Caron et al. (Meta FAIR / Inria)

Self-distillation with no labels：学生预测教师 (EMA) 的 softmax，纯靠 multi-crop + centering + sharpening 学到能直接做分割的视觉特征。

LLM2024· CLASSIC

DINOv2: Learning Robust Visual Features without Supervision

Oquab et al. (Meta FAIR)

把 DINO 的 self-distillation + 142M 精挑数据 + iBOT 风格 patch-level loss 推到 ViT-g (1.1B)，产出可直接用于多种下游任务的「通用视觉特征」。

sslvisionvitfoundation-model

sslcontrastivevisionfoundational

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

Chen et al. (Google)

用最简单的两段式增广 + InfoNCE 把 contrastive SSL 推上 ImageNet linear probe 76.5% —— 证明数据增广 + 大 batch 就够。

LLM2019· CLASSIC

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

He et al. (FAIR)

用 momentum encoder + queue 把 negatives 从 batch 里解放出来，小 batch 也能做 contrastive。

sslcontrastivevision

sslvisionvitmasked-modeling

MAE: Masked Autoencoders Are Scalable Vision Learners

He et al. (FAIR)

BERT-for-vision：mask 掉 75% patch、不对称 encoder/decoder、像素级重建——把 ViT 的 SSL 推到生产可用。

LLM2013· CLASSIC

Auto-Encoding Variational Bayes (VAE)

Kingma & Welling

用变分推断 + reparameterization trick 让生成模型可以 backprop——latent space 生成模型的基石。

generativevaefoundationallatent

sslaudiospeechcontrastive

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Baevski et al. (FAIR)

把 BERT 的 masked prediction 搬到原始音频，10 分钟标注就能训出 ASR——语音 SSL 范式确立。

diffusiongenerativefoundational

DDPM: Denoising Diffusion Probabilistic Models

Ho, Jain, Abbeel (Berkeley)

把扩散过程化简为预测噪声 ε，让 diffusion 训练像普通监督学习——开启 diffusion 时代。

diffusiongenerativestable-diffusionvision

High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)

Rombach et al. (Heidelberg/Runway)

在 VAE 压缩后的 latent space 上做 diffusion——把 512×512 生成做到消费级 GPU，开启开源 AIGC 浪潮。

diffusiongenerativeconditioning

Classifier-Free Diffusion Guidance

Ho & Salimans (Google)

训练时随机丢掉 condition，推理时把 conditional 和 unconditional 噪声预测做外推——免分类器的强 conditioning。

diffusiontransformergenerative

DiT: Scalable Diffusion Models with Transformers

Peebles & Xie (NYU/UC Berkeley)

把 U-Net 换成 Transformer——证明 diffusion 也遵循 scaling law；Sora / SD3 / FLUX 都跟进。

vlmmultimodalfoundational

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs

Li et al. (Salesforce)

用一个轻量 Q-Former 把 frozen vision encoder 和 frozen LLM 桥起来——参数效率 + 模块复用的范式开端。

vlmmultimodalfew-shotfoundational

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac et al. (DeepMind)

在 frozen LLM 里插入 cross-attention 层让它看见图像和视频——首个真正的 GPT-3 for vision，开启 in-context VQA。

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Radford et al. (OpenAI)

用 68 万小时弱监督音频 + 标准 encoder-decoder Transformer 把多语种 ASR 鲁棒性拉满——音频版 GPT 时刻。

audiospeechasrmultimodal

multimodalembeddingalignment

ImageBind: One Embedding Space To Bind Them All

Girdhar et al. (Meta)

只用 image-X 配对数据就能把 6 种模态对齐到同一 embedding space——跨模态检索无需 N×N 配对。

LLM2017· CLASSIC

Attention Is All You Need

Vaswani et al.

用 Self-Attention 完全取代 RNN 和 CNN——Transformer 架构的开山之作，几乎所有现代 LLM 的地基。

transformerattentionarchitecturefoundational

LLM2018· CLASSIC

BERT: Pre-training of Deep Bidirectional Transformers

Devlin et al.

Masked Language Modeling + 双向 Transformer Encoder——预训练+微调范式的奠基之作。

pretrainencoderfoundationalnlp

gptpretrainscalingin-context-learning

Language Models are Few-Shot Learners (GPT-3)

Brown et al.

175B 参数的 Decoder-only LM，首次展示 "只通过 Prompt 就能解决新任务" 的 In-Context Learning 能力。

Training language models to follow instructions (InstructGPT)

Ouyang et al.

RLHF 的奠基工作——SFT + Reward Model + PPO 的三阶段对齐框架，直接催生了 ChatGPT。

rlhfalignmentsftppo

peftfinetuneefficiencyfoundational

LoRA: Low-Rank Adaptation of Large Language Models

Hu et al.

用两个低秩矩阵（A·B）替代全参数微调——训练参数量降到 0.1%-1%，推理零额外开销。

position-encodingattentionarchitecture

RoFormer: Rotary Position Embedding

Su et al.

用二维旋转矩阵编码位置——相对位置天然反映在 query 和 key 的点积里，支持长度外推。

attentioninferenceefficiencysystems

FlashAttention: Fast and Memory-Efficient Exact Attention

Dao et al.

用 tiling + recomputation 把 O(n²) 注意力从 HBM IO-bound 变成 SRAM compute-bound——**精确** attention，不近似。

promptingreasoningin-context-learning

Chain-of-Thought Prompting Elicits Reasoning

Wei et al.

给 LLM few-shot 示例时加入推理过程（"Let's think step by step"），数学/常识推理性能飙升。

alignmentrlhfpreference-learning

Direct Preference Optimization

Rafailov et al.

数学证明 RLHF 的 RM+PPO 两阶段可以合成一个 SFT-style 损失——跳过 RM，直接从偏好数据优化策略。

RAG2020· CLASSIC

Retrieval-Augmented Generation for Knowledge-Intensive NLP

Lewis et al.

把参数化知识（LM 权重内）和非参数化知识（外部向量库）结合——RAG 赛道的起点。

ragretrievalfoundational

MLLM2021· CLASSIC

CLIP: Learning Transferable Visual Representations

Radford et al.

4 亿图文对 + 对比学习——零样本图像分类能和监督训练的 ResNet 匹敌。

multimodalcontrastivevision-languagefoundational

MLLM2023· CLASSIC

LLaVA: Visual Instruction Tuning

Liu et al.

CLIP-ViT + LLaMA + 一个 linear projection——简单但有效的开源 VLM 基线。

vlmmultimodalinstruction-tuning

inferencesystemsservingkv-cache

Efficient Memory Management for LLM Serving with PagedAttention (vLLM)

Kwon et al.

借鉴操作系统虚拟内存——把 KV Cache 分成固定大小的块，大幅减少内存碎片，吞吐提升 2-4×。

LLM2025· CLASSIC

DeepSeek R1: Incentivizing Reasoning via Reinforcement Learning

DeepSeek-AI

纯 RL（无 SFT warmup）激发涌现的推理能力——Chain-of-Thought 自然从奖励信号中生长。

reasoningrlhfgrpofrontier