Code and data from our physicians-in-the-loop curation of new ground truth labels for MedCalc-Bench, a benchmark (NeurIPS 2024 Oral & included in MedHELM) for evaluating LLMs on medical score ...