The use of artificial intelligence (AI) for health information is becoming increasingly common, especially among women. However, a recent study has revealed that many commonly used AI models are failing to accurately diagnose or provide advice for women’s health queries that require urgent attention.
Thirteen large language models, including those developed by OpenAI, Google, Anthropic, Mistral AI, and xAI, were tested with 345 medical queries across five specialties, such as emergency medicine, gynecology, and neurology. These queries were curated by 17 women’s health experts from the US and Europe. The responses generated by the AI models were then evaluated by the same experts to determine their accuracy.
The results were concerning, with approximately 60% of the questions being answered in a manner that the experts deemed insufficient for medical advice. Among the models tested, GPT-5 performed the best, with a failure rate of 47%, while Ministral 8B had the highest failure rate of 73%.
The lead researcher, Victoria-Elisabeth Gruber from Lumos AI, expressed her surprise at the high rate of failure among the AI models. She highlighted the importance of addressing the existing gender biases in medical knowledge that are inherited and amplified by AI technologies.
Experts such as Cara Tannenbaum from the University of Montreal emphasized the need for online health sources and healthcare professional societies to update their content with sex and gender-related evidence-based information to improve the accuracy of AI models in supporting women’s health.
While some critics have raised concerns about the sample size and design of the study, Gruber defended the conservative approach taken in the evaluation, stating that even minor omissions in healthcare can have significant consequences.
In response to the study, OpenAI stated that their ChatGPT model is designed to support, not replace, medical care. They work closely with clinicians to improve their models and ensure the accuracy of the information provided. Other companies whose AI models were tested did not respond to requests for comment.
Overall, the study highlights the importance of continuously evaluating and improving AI models to ensure their accuracy and reliability in providing health information, especially in the context of women’s health. By addressing the existing biases and shortcomings, AI technologies can better serve the healthcare needs of diverse populations.

