Seek .-Llama 4发布：我看到了DeepSeek的影子

文章核心观点 - Llama 4 未追求参数领先，而是通过三款模型重新布局，体现架构、多模态和训练方面的转向，宣告了路线调整 [2][24][25] Llama 4 模型布局 - Llama 4 通过三款模型重新布局，分别为一个实用、一个主力、一个教学，不追求通吃所有任务 [2] 架构转向 - Llama 3 是 Dense 架构，Llama 4 是 MoE 架构 [4] - Scout 为 109B 参数、17B 激活、16 专家 MoE，可部署在单张 H100 上，有 10M token 长上下文，适用于文档分析等任务 [5] - Maverick 为 400B 参数、17B 激活、128 专家 MoE，1M token 长上下文，对标 GPT - 4o，性能不逊色且推理成本仅为其十分之一 [5] - Behemoth 为 2T 参数、288B 激活、16 专家 MoE，不部署不开放，仅用于训练阶段生成训练数据 [5] - 过去 MoE 多为“实验室选项”，自 DeepSeek 大火后，Meta 等厂商开始用于主力模型，推理时 Scout 和 Maverick 都只激活两个，激活量为 17B [7] - MoE 不适合所有任务场景，存在调度复杂等训练难题，但参数使用方式值得设计 [9] 多模态转向 - Llama 3 图像输入依赖外挂 encoder，Llama 4 图像直接作为 token 输入参与语言上下文建模 [10] - 原生多模态结构使 Maverick 在 DocVQA 等任务表现提升，推理成本仅为 GPT - 4o 的十分之一 [12] - Scout 虽为轻量模型，但在 DocVQA、ChartQA 上表现高于同尺寸及部分大模型 [15] - DeepSeek 的 V3/R1 仍未引入图像 token [18] 训练转向 - Behemoth 不对外，作用是为 Scout 和 Maverick 生成训练数据、提供能力示范并优化行为，Meta 更注重训练系统本身 [19][22] - 类似 OpenAI 开发“草莓”训练新 GPT、DeepSeek 开发 DeepSeek - R1 - Light 训练 DeepSeek V3 [23] 跑分成绩对比 Maverick 与其他模型对比 | Category Benchmark | Llama 4（Maverick） | Gemini 2.0 Flash | DeepSeek v3.1 | GPT - 4o | | --- | --- | --- | --- | --- | | Inference Cost（ $per 1M input & output tokens (3:1 blended)） |$ 0.19 - $0.495 | | | | | Image Reasoning（MMMU） | 73.4 | 71.7 | | 69.1 | | MathVista | 73.7 | 73.1 | | 63.8 | | Image Understanding（ChartQA） | 90.0 | 88.3 | | 85.7 | | DocVQA (test) | 94.4 | - | | 92.8 | | Coding（LiveCodeBench） | 43.4 | 34.5 | 45.8/49.23 | 32.33 | | Reasoning & Knowledge（MMLU Pro） | 80.5 | 77.6 | 81.2 | - | | GPQA Diamond | 69.8 | 60.1 | 68.4 | 53.6 | | Multilingual（Multilingual MMLU） | 84.6 | - | - | 81.5 | | Long Context（MTOB (half book) eng → kgv/kgv → eng） | 54.0/46.4 | 48.4/39.84 | | | | Long Context（MTOB (full book) eng → kgv/kgv → eng） | 50.8/46.7 | 45.5/39.64 | | | [14] Scout 与其他模型对比 | Category Benchmark | Llama 4（Scout） | Llama 3.3（70B） | Llama 3.1（405B） | Gemma 3（27B） | Mistral 3.1（24B） | Gemini Flash - l | | --- | --- | --- | --- | --- | --- | --- | | Image Reasoning（MMMU） | 69.4 | No multimodal support | No multimodal support | 64.9 | 62.8 | 68.0 | | MathVista | 70.7 | | | 67.6 | 68.9 | 57.6 | | Image Understanding（ChartQA） | 88.8 | | | 76.3 | 86.2 | 73.0 | | DocVQA (test) | 94.4 | | | 90.4 | 94.1 | 91.2 | | Coding（LiveCodeBench） | 32.8 | 33.3 | 27.7 | 29.7 | - | 28.9 | | Reasoning & Knowledge（MMLU Pro） | 74.3 | 68.9 | 73.4 | 67.5 | 66.8 | 71.6 | | GPQA Diamond | 57.2 | 50.5 | 49.0 | 42.4 | 46.0 | 51.5 | | Long Context（MTOB (half book) eng -> kgv/kgv -> eng） | 42.2/36.6 | | | | | 42.3/3! | | Long Context（MTOB (full book) eng -> kgv/kgv -> eng） | 39.7/36.3 | | | | | 35.1/30 | [17] Behemoth 与其他模型对比 | Category Benchmark | Llama 4 Behemoth | Claude Sonnet 3.7 | Gemini 2.0 Pro | GPT - 4.5 | | --- | --- | --- | --- | --- | | Coding（LiveCodeBench） | 49.4 | | 36.03 | - | | Reasoning & Knowledge（MATH - 500） | 95.0 | | 82.2 | - | | MMLU Pro | 82.2 | - | 79.1 | - | | GPQA Diamond | 73.7 | | 68.0 | 71.4 | | Multilingual（Multilingual MMLU (OpenAl)） | 85.8 | | 83.2 | 85.1 | | Image Reasoning（MMMU） | 76.1 | | 71.8 | 74.4 | [21]