Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
This study introduces MedCheck, a framework with 46 criteria to assess the quality of medical benchmarks for large language models (LLMs). Analysis of 53 existing benchmarks revealed systemic issues including a disconnect from clinical practice, poor data quality control, and a lack of safety and fairness evaluations. The paper proposes MedCheck as a tool to guide the creation of more robust and clinically relevant benchmarks.