Google and Stanford researchers launch AI fact-checking tool

Google and Stanford researchers is a public record based on article evidence, entity context, event links, and relationship context.

A recent development by Google DeepMind and Stanford University introduces the Search-Augmented Factuality Evaluator (SAFE), a tool designed to fact-check long responses from AI chatbots. SAFE employs a multi-step process, including segmentation, correction, and comparison with Google search results, achieving a 76% accuracy rate in verifying controversial facts. This innovation not only enhances accuracy in AI-generated responses but also presents economic advantages, being over 20 times cheaper than manual annotation.

No matter how powerful current AI chatbots are, there tends to exist a much-criticised behaviour providing users with answers that are somewhat convincing but factually inaccurate. Simply put, AI sometimes ‘runs off the rails’ in its responses, even ‘spreading rumours’. Preventing such behaviour in AI large models is no easy task and is a technical challenge. However, according to the foreign media Marktechpost, Google DeepMind and Stanford University seem to have found a workaround.

Also read: OpenAI’s GPT store fails to meet expectations Also read: US federal agencies now required to have chief AI officer The tool is based on the Search-Augmented Factuality Evaluator (SAFE) Researchers have introduced a tool based on large language models the Search-Augmented Factuality Evaluator (SAFE), which can fact-check long responses generated by chatbots. Their research results, along with experimental code and datasets, have now been made public, click here to view.

The system analyses, processes, and evaluates the responses generated by chatbots through four steps to verify accuracy and authenticity: segmenting the answers into individual items for verification, correcting the above content, and then comparing it with Google search results. Subsequently, the system also checks the relevance of each fact to the original question. Researchers created a dataset called LongFact to assess its performance To assess its performance, researchers created a dataset called LongFact containing approximately 16,000 facts and tested the system on 13 large language models from Claude, Gemini, GPT, and PaLM-2.

The results show that in the focused analysis of 100 controversial facts, SAFE’s judgment accuracy reaches 76% upon further review. At the same time, the framework also has economic advantages: it is more than 20 times cheaper than manual annotation.

Google and Stanford researchers launch AI fact-checking tool

Signal Brief

Operating Footprint

Market Context

What To Watch

Deeper Trend Context

Strategic Circle

Leadership Alliance

Strategy Circle Briefing

Leadership Alliance Briefing