医学基准评估

搜索文档
ICML 2025 | 清华、上海AI Lab提出专家级医学基准MedXpertQA,看o3、R1哪家强
机器之心· 2025-07-08 12:09
本文作者来自于清华大学和上海 AI Lab,通讯作者为清华大学丁宁助理教授和清华大学讲席教授、上海 AI Lab 主任周伯文教授。 论文已被 ICML 2025 接收,并且被 DeepMind MedGemma 采用为评估基准 。 | Metric | MedGemma 27B | Gemma 3 27B | MedGemma 4B | Gemma 3 4B | | --- | --- | --- | --- | --- | | MedQA (4-op) | 89.8 (best-of-5) 87.7 (0-shot) | 74.9 | 64.4 | 50.7 | | MedMCQA | 74.2 | 62.6 | 55.7 | 45.4 | | PubMedQA | 76.8 | 73.4 | 73.4 | 68.4 | | MMLU Med (text only) | 87.0 | 83.3 | 70.0 | 67.2 | | MedXpertQA (text only) | 26.7 | 15.7 | 14.2 | 11.6 | | AfriMed-QA | 84.0 | 72.0 | 52.0 | 4 ...