Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in large language models (LLMs) offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. In healthcare, this can make LLMs unusable at best and dangerous at worst. We conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of LLMs in the specific context of medical evidence reviews. Experts indicated that LLMs can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. They also raised concerns regarding confidently composed but inaccurate LLM outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.
医学系统综述在医疗决策和政策制定中起着至关重要的作用。然而,其制作过程耗时,限制了高质量和最新证据综述的获取。大型语言模型(LLMs)的最新进展提供了按需自动生成文献综述的可能性,从而解决这一问题。然而,LLMs有时会因幻觉或遗漏而生成不准确(且可能具有误导性)的文本。在医疗领域,这可能导致LLMs往好了说是无法使用,往坏了说则是危险的。我们对国际系统综述专家进行了16次访谈,以描述在医学证据综述的特定背景下LLMs的感知效用和风险。专家们表示,LLMs可以通过起草摘要、生成模板、提炼信息和交叉核对信息来辅助写作过程。他们还对自信撰写但不准确的LLM输出以及其他潜在的下游危害表示担忧,包括责任感降低和低质量综述的泛滥。基于这一定性分析,我们确定了符合领域专家观点的严格评估生物医学LLMs的标准。