A study published in Science by researchers from Harvard Medical School and Beth Israel Deaconess Medical Center found that OpenAI’s o1 and 4o large language models outperformed internal medicine physicians in diagnosing patients during emergency room triage. The study evaluated 76 real-world ER cases, comparing AI-generated diagnoses to those made by two attending internal medicine physicians, with assessments conducted by two additional physicians blinded to the source of each diagnosis.
Overview
The research team tested how well AI models could generate accurate diagnoses using only the information available in electronic medical records at the time of initial patient evaluation. No data pre-processing was performed, ensuring the models received the same inputs as human clinicians. Diagnoses from the o1 and 4o models were compared against those from two internal medicine attending physicians, with performance assessed at multiple diagnostic touchpoints.
What it does
At the first diagnostic touchpoint—initial ER triage, when clinical information is most limited—the o1 model provided the exact or very close diagnosis in 67% of cases. In comparison, one physician reached the correct or near-correct diagnosis in 55% of cases, and the other in 50%. The study noted that differences were most pronounced at this early stage, where rapid decision-making is critical and data is sparse. The AI models leveraged probabilistic reasoning and access to vast medical knowledge bases to identify subtle diagnostic patterns.
The study emphasized that models were evaluated solely on text-based inputs and did not process imaging, lab results, or other non-text data. Researchers caution that current foundation models remain limited in reasoning over nontext inputs, and no claims were made about AI readiness for autonomous clinical decision-making.
Tradeoffs
While the o1 model outperformed physicians in diagnostic accuracy at triage, the study’s authors stress that AI is not positioned to replace human clinicians. Arjun Manrai, lead author and head of an AI lab at Harvard Medical School, stated the model “eclipsed both prior models and our physician baselines,” but the team calls for prospective trials in real-world care settings before clinical deployment. Adam Rodman, another lead author, highlighted the lack of a formal accountability framework for AI-generated diagnoses.
Emergency physician Kristen Panthagani noted a key limitation: the comparison was made against internal medicine physicians, not emergency medicine specialists. She argued that ER doctors prioritize identifying immediately life-threatening conditions over pinpointing final diagnoses, a distinction not fully captured in the study’s evaluation criteria.
When to use it
The findings suggest AI could serve as a decision-support tool during early triage, especially in resource-constrained or high-volume settings. However, integration into clinical workflows requires rigorous validation, clear accountability protocols, and alignment with specialty-specific diagnostic goals. The study does