- A recent development by Google DeepMind and Stanford University introduces the Search-Augmented Factuality Evaluator (SAFE), a tool designed to fact-check long responses from AI chatbots.
- SAFE employs a multi-step process, including segmentation, correction, and comparison with Google search results, achieving a 76% accuracy rate in verifying controversial facts.
- This innovation not only enhances accuracy in AI-generated responses but also presents economic advantages, being over 20 times cheaper than manual annotation.
No matter how powerful current AI chatbots are, there tends to exist a much-criticised behaviour – providing users with answers that are somewhat convincing but factually inaccurate. Simply put, AI sometimes ‘runs off the rails’ in its responses, even ‘spreading rumours’. Preventing such behaviour in AI large models is no easy task and is a technical challenge. However, according to the foreign media Marktechpost, Google DeepMind and Stanford University seem to have found a workaround.
Also read: OpenAI’s GPT store fails to meet expectations
Also read: US federal agencies now required to have chief AI officer
The tool is based on the Search-Augmented Factuality Evaluator (SAFE)
Researchers have introduced a tool based on large language models – the Search-Augmented Factuality Evaluator (SAFE), which can fact-check long responses generated by chatbots. Their research results, along with experimental code and datasets, have now been made public, click here to view.
The system analyses, processes, and evaluates the responses generated by chatbots through four steps to verify accuracy and authenticity: segmenting the answers into individual items for verification, correcting the above content, and then comparing it with Google search results. Subsequently, the system also checks the relevance of each fact to the original question.
Researchers created a dataset called LongFact to assess its performance
To assess its performance, researchers created a dataset called LongFact containing approximately 16,000 facts and tested the system on 13 large language models from Claude, Gemini, GPT, and PaLM-2. The results show that in the focused analysis of 100 controversial facts, SAFE’s judgment accuracy reaches 76% upon further review. At the same time, the framework also has economic advantages: it is more than 20 times cheaper than manual annotation.