AI Has Demonstrated Incompetence In Womens Health Issues

Most common AI models are unable to accurately diagnose or provide advice on many women’s health issues that require immediate attention.

Thirteen large language models developed by companies such as OpenAI, Google, Anthropic, Mistral AI and xAI received 345 medical queries from five areas, including emergency medicine, gynecology and neurology. The questions were compiled by 17 women’s health researchers, pharmacists, and clinicians from the United States and Europe.

The models’ answers were checked by the same group of experts. The results are published on arXiv. All questions that the models failed to answer were combined into a test set (benchmark) to evaluate the medical competence of the AI, which ultimately included 96 questions.

The average proportion of questions whose answers are not suitable for medical advice, across all models, was about 60%. The best result was shown by GPT-5, making mistakes in 47% of cases, while Ministral 8B had the highest error rate – 73%.

“I see and women in my community turning to AI tools for medical advice and decision-making support. This is what motivated us to create the first benchmark in this area,” explains Victoria-Elisabeth Gruber of Lumos AI, which helps other companies evaluate and improve their own AI models.

Unexpectedly weak results

The researcher admits that she was surprised by the level of errors: “We expected some gaps, but what was especially striking was the extent of the differences between the models.”

The results are quite expected, given what AI models are trained on—data that is riddled with errors and inaccuracies, says Kara Tannenbaum of the University of Montreal.

“There is a clear need for online health information sources, as well as professional health communities, to update their web content to include explicit sex and gender information so that AI can accurately support women’s health,” she points out.

The 60% error rate is somewhat misleading, says Jonathan H. Chen of Stanford University.

“I wouldn’t get too hung up on the 60% number because the sample was limited and specially designed by experts,” he emphasizes. “It was not intended to be broad or representative of questions that patients or doctors typically ask.”

Additionally, some scenarios in the test were overly conservative, with a high potential failure rate. For example, if a woman has a headache after childbirth, and the model did not suspect preeclampsia, this answer was considered erroneous.

AI is not a replacement for a doctor

“Our goal was not to claim that models are universally unsafe, but to define a clear, clinically sound standard for evaluation. The benchmark is intentionally conservative and rigorous in its definition of errors, because in healthcare, even seemingly minor omissions can have meaning depending on the context,” Gruber explained.

«ChatGPT intendedto support, not replace, medical care,” OpenAI recalled. “We take the accuracy of model output seriously, and while ChatGPT can provide useful information,Users should rely only on qualified physicians in their treatment decisions».

Subscribe and read “Science” in


Telegram

Disclaimer: This news article has been republished exactly as it appeared on its original source, without any modification.
We do not take any responsibility for its content, which remains solely the responsibility of the original publisher.


Disclaimer: This news article has been republished exactly as it appeared on its original source, without any modification.
We do not take any responsibility for its content, which remains solely the responsibility of the original publisher.


Author: uaetodaynews
Published on: 2026-01-07 20:15:00
Source: uaetodaynews.com

مقالات ذات صلة

زر الذهاب إلى الأعلى