OBJECTIVE
This study aims to evaluate ChatGPT's performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare.
METHODS
A cross-sectional study was conducted using patient questions from the public online forum Reddit's r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified.
RESULTS
Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous.
CONCLUSION
While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.
目的
本研究旨在评估ChatGPT在解答现实世界耳鼻喉科患者问题方面的表现,重点关注准确性、全面性和患者安全,以评估其融入医疗保健领域的适用性。
方法
利用来自公共在线论坛Reddit的r/AskDocs板块(人们在此向医疗专业人员寻求医疗建议)上的患者问题进行了一项横断面研究。将患者问题输入ChatGPT(GPT - 3.5),并由5名获得执业资格的耳鼻喉科医生对回复进行评审。评估标准包括难度、准确性、全面性以及临床态度/同理心。统计分析探究了患者问题特征与ChatGPT回复评分之间的关系,同时还识别出了可能存在危险的回复。
结果
患者问题平均字数为224.93个,而ChatGPT的回复更长,为414.93个单词。ChatGPT回复的准确性得分为3.76/5,全面性得分为3.59/5,临床态度/同理心得分为4.28/5。较长的患者问题与更高的回复评分并无关联。然而,较长的ChatGPT回复在临床态度/同理心方面得分更高。问题难度越高,全面性得分越低。有5条回复被标记为可能存在危险。
结论
虽然ChatGPT在解答耳鼻喉科患者问题方面显示出一定的潜力,但本研究也揭示了其局限性,特别是在准确性和全面性方面。对可能存在危险回复的识别凸显了在医疗建议中对人工智能采取谨慎态度的必要性。将人工智能负责任地融入医疗保健领域需要对模型性能进行全面评估,并考虑患者安全方面的伦理问题。