OpenAI:开源基准测试工具HealthBench:大型语言模型在医疗领域的性能和安全性评估(英文版)
发布者:wx****7c
2025-05-26
8 MB
39 页
文件列表:
OpenAI:开源基准测试工具HealthBench:大型语言模型在医疗领域的性能和安全性评估(英文版).pdf |
下载文档 |
资源简介
>
We present HealthBench, an open-source benchmark measuring the performance and safety of large
language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model
and an individual user or healthcare professional. Responses are evaluated using conversation-specific
rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning sever
加载中...
本文档仅能预览20页