Researchers tested an AI model against ER doctors and found the model outperformed the humans.
shapecharge/E+/Getty Images
hide caption
toggle caption
shapecharge/E+/Getty Images
A patient arrives at the hospital with a pulmonary embolism, a blood clot that has reached the lungs. After initial improvements, the symptoms begin to deteriorate, leading the medical team to suspect that the medication isn’t effective.
Enter artificial intelligence with its own hypothesis.
After reviewing the medical records, the AI identifies a history of lupus, an autoimmune disorder that can cause heart inflammation, as the likely underlying issue.
The AI’s diagnosis proves to be accurate.
Such a scenario might soon become commonplace, according to a study published Thursday in the journal Science.
Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center discovered that an AI reasoning model developed by OpenAI excelled in diagnosing patients and making decisions on their treatment. It often outperformed doctors and an earlier AI model, GPT-4.
The team conducted several experiments to assess the AI’s clinical expertise, including cases like the lupus patient treated at Beth Israel’s emergency department in Boston.
They evaluated the AI’s ability to deliver an accurate diagnosis at various stages, from triage in the ER to hospital admission.
Overall, the AI surpassed two experienced physicians using only electronic health records and limited information available at the time.
“This is the big conclusion for me — it works with the messy real-world data of the emergency department,” said Dr. Adam Rodman, a clinical researcher at Beth Israel and one of the study authors. “It works for making diagnoses in the real world.”
Other parts of the study used challenging case reports from the New England Journal of Medicine and clinical vignettes to determine if the AI could meet established “benchmarks” and tackle complex diagnostic questions.
“The model outperformed our very large physician baseline,” said Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School, also involved in the study.
The authors note that the research relied solely on text, whereas real-life clinicians must consider other factors such as images, sounds, and nonverbal cues in diagnosis and treatment.
Nonetheless, the study highlights the significant advancements in technology over recent years. Previous generations of large language models struggled with uncertainty and generating differential diagnoses.
“This paper is a beautiful summary of just how much things have improved,” says Dr. David Reich, chief clinical officer for Mount Sinai Health System in New York, who was not involved in the research.
“You have something which is quite accurate, possibly ready for prime time,” he says. “Now the open question is how the heck do you introduce it into clinical workflows in ways that actually improve care?”
Achieving a complex final diagnosis, where the AI excels, may not fully represent real-world clinical medicine, according to Reich, where outcomes are often more nuanced and varied.
Moreover, the emergency department represents only a segment of a patient’s overall medical care. Rodman acknowledges that AI might not have performed as impressively if provided with records of a patient who had been hospitalized for a month.
Participants in the study do not believe the findings suggest replacing doctors with AI, “despite what some companies are likely to say and how they’re likely to use these results,” says Manrai.
“I think it does mean that we’re witnessing a really profound change in technology that will reshape medicine,” he adds.
However, the study underscores the importance of testing AI models rigorously, ideally through prospective trials that can provide greater certainty about the technology’s impact on clinical practice.
“It’s a very challenging process to design these trials,” says Reich, “but this study is a perfect call to action.”

