Can researchers stop AI making up citations?

Can researchers stop AI making up citations?

EL.KZ Информационно-познавательный портал

09.09.2025 09:17

Фото: El.kz / Grok / Dinmukhamed Beissembayev

"For most cases of hallucination, the rate has dropped to a level” that seems to be “acceptable to users”, says Tianyang Xu, an AI researcher at Purdue University in West Lafayette, Indiana. But in particularly technical fields, such as law and mathematics, GPT-5 is still likely to struggle, she says, El.kz cites Nature.

OpenAI is making “small steps that are good, but I don’t think we’re anywhere near where we need to be”, says Mark Steyvers, a cognitive science and AI researcher at the University of California, Irvine. “It’s not frequent enough that GPT says ‘I don’t know’.”

_{A feature, not a bug}

Hallucinations are a result of the fundamental way in which LLMs work. As statistical machines, the models make predictions by generalizing on the basis of learnt associations, leading them to produce answers that are plausible, but sometimes wrong.Another issue is that, similar to a student scoring points for guessing on a multiple choice exam, during training LLMs get rewarded for having a go rather than acknowledging their uncertainty, according to a preprint published by OpenAI on 4 September.

Improvements have come from scaling up the size of LLMs — in terms of both the richness of their internal associations and the amount of data they are trained on, says Xu. But hallucinations are particularly prevalent in topics for which the model has scant training data or its underlying information is wrong, she says. Hallucinations can also happen when an AI tries to summarize or analyse papers that are too long for that model to process.

Eliminating hallucinations entirely is likely to prove impossible, says Mushtaq Bilal, a researcher at Silvi, a Copenhagen-based firm that makes an AI app to aid the creation of systematic reviews in science. “I think if it was possible, AI labs would have done it already.”

But reducing errors and getting a model to admit that it doesn’t know an answer have been “a pretty heavy focus” for OpenAI, says Saachi Jain, who manages the firm’s AI safety team. According to technical documents released with GPT-5, OpenAI concentrated on “training our models to browse effectively for up-to-date information”, as well as cutting hallucinations. The firm focused on reducing hallucinations in lengthy, open-ended responses to queries, because this best represents real-life use of ChatGPT, says Jain.

In one literature-review benchmark known as ScholarQA-CS, GPT-5 “performs well” when it is allowed to access the web, says Akari Asai, an AI researcher at the Allen Institute for Artificial Intelligence, based in Seattle, Washington, who ran the tests for Nature. In producing answers to open-ended computer-science questions, for example, the model performed marginally better than human experts, with a correctness score of 55% (based on measures such as how well its statements are supported by citations) compared with 54% for scientists, but just behind a version of institute’s own LLM-based system for literature review, OpenScholar, which achieved 57%.

However, GPT-5 suffered when the model was unable to get online, says Asai. The ability to cross-check with academic databases is a key feature of most AI-powered systems designed to help with literature reviews. Without Internet access, GPT-5 fabricated or muddled half the number of citations that one of its predecessors, GPT-4o, did. But it still got them wrong 39% of the time, she says.

On the LongFact benchmark, which tests accuracy in long-form responses to prompts, OpenAI reported that GPT-5 hallucinated 0.8% of claims in responses about people or places when it was allowed to browse the web, compared with 5.1% for OpenAI’s reasoning model o3. Performance dropped when browsing was not permitted, with GPT-5’s error rate climbing to 1.4% compared with 7.9% for o3. Both models showed worse performance than did the non-reasoning model GPT-4o, which had an error rate of 1.1% when offline.

On other independent evaluations — such as the Hughes Hallucination Evaluation Model, which is run by the AI platform Vectara in Palo Alto, California, and looks at how often an LLM makes false claims when summarizing a document — rival models such as Google’s Gemini 2.0 slightly outperformed GPT-5, although both erred less than 1.5% of the time

Subscribe to our Telegram channel and be the first to know the news!

Can researchers stop AI making up citations?

A feature, not a bug

El recommends

Shezhire in AI era: Maksat Zhabagin on preserving national digital heritage

Nearly 20.5 mln people: how Kazakhstan’s demographics is changing

Kazakhstan, Finland strengthen cooperation in carbon neutrality

_{A feature, not a bug}