第 1 期 · 公开大纲2026 / 5 / 25 开营

7 天 MLLM 多模态冲刺营 · 大纲

6 天主题课 + 1 天结营 mock。6 个主题 · 62+ 必读论文 · 1 次免费 AI mock 面试 · 营长批改作业。下面是每天具体安排，方便你判断是否适合。

立即报名 ¥99 →

把复制的 Markdown 直接贴到微信群 / 朋友圈 / 小红书即可发布

Day 0

📚

开营准备（开营前自学）

· DL 基础回顾（监督 / SSL / Cross-entropy / Embedding / 过拟合 / 微调）
· 推荐 12 个外部课程：Andrew Ng / fast.ai / Karpathy Zero-to-Hero / 3Blue1Brown / CS231n / CS224n / HuggingFace NLP / 等

Day 1

✦

Transformer / LLM 基础

Why is the Transformer the universal backbone of modern AI?

●Tokenization · Embedding · Positional Encoding
●Self-Attention · Multi-Head · FFN · LayerNorm
●Encoder-only / Decoder-only / Encoder-Decoder
●LLM: next-token / CLM / instruction tuning
●Fine-tuning: LoRA · PEFT · prompt tuning
●解码策略：greedy / beam / top-k / top-p

📄 8 篇必读论文💻 用 numpy 实现 scaled dot-product attention + Transformer 架构图

Day 2

✦

Self-Supervised / Contrastive / Autoencoders

How do models learn from unlabeled media data?

●SSL 概念 · pretext task
●Contrastive Learning · InfoNCE · CLIP
●Multimodal contrastive (image-text / audio-text / video-text)
●Autoencoders · VAE · latent space
●Masked Modeling: MLM / MIM / MAE

📄 10 篇必读论文💻 PyTorch 写 InfoNCE loss + 画 CLIP 训练流程图

Day 3

✦

Generative Models: AR · Diffusion · VAE

How do models generate text, image, audio, video?

●Autoregressive: sequence factorization · exposure bias
●Diffusion: DDPM / latent / CFG / Sora
●VAE: latent variable · KL · reparameterization
●Text→image / audio / video 全链路
●AR vs Diffusion vs VAE 选型

📄 11 篇必读论文💻 写 toy DDPM forward step + 画 Stable Diffusion 流程图

Day 4

✦

Vision-Language / Video-Language Models

How do models understand images & videos with language?

●三种 VLM 架构 (Dual / Enc-Dec / Multimodal LLM)
●Image captioning · VQA · grounding · hallucination
●Video-Language: frame sampling · temporal · long video
●Multimodal LLM (BLIP-2 / LLaVA / Flamingo)
●Video 比 image 难在哪

📄 11 篇必读论文💻 画 Multimodal LLM 架构图 + video captioning system design

Day 5

✦

Audio AI: Speech / Sound / Music / Spatial

How do models understand and generate audio?

●Audio representation (mel-spec / codec)
●Speech AI: ASR / TTS / separation / Whisper
●Music / Sound: text-to-music / SED
●Spatial Audio (Binaural / Object-based / Ambisonics)
●Audio-Language Models (wav2vec / AudioLM / Qwen2-Audio)

📄 11 篇必读论文💻 选 1 个 audio AI 任务写 system design + 录 2 分钟解释

Day 6

✦

Multimodal Fusion / Reasoning / Evaluation

How to combine modalities for understanding & reasoning?

●Fusion 类型 (early / late / cross-attention / LLM-based)
●Tri-modal alignment · ImageBind / LanguageBind
●Multimodal reasoning · CoT · 因果
●评测: MM-Vet / MMMU / SEED-Bench
●Hallucination 检测

📄 11 篇必读论文💻 画 tri-modal fusion 架构 + 列 5 道高频题自答

Day 7

🎓

结营 Mock · 1v1 反馈 + 营长批改

· 群内 1v1 配对模拟面试（30 min）
· AI 评分 + 营长写一份针对你的个性化反馈
· 颁发第 1 期训练营徽章（可放简历 / LinkedIn）

看完觉得对路？立即加入第 1 期

第 1 期试运营价 ¥99（含 1 次免费 AI mock 面试） · 限 30 人 · 滚动开营，报名后第二天进入 Day 1