As generative AI services such as OpenAI’s ChatGPT, Microsoft’s Bing Chat, and Google Bard are increasingly used as alternatives to search engines, they are also encountering resistance from individuals and companies who don’t want their website data to be used for AI model training.
Large language models are trained on a variety of data, much of which appears to have been collected without anyone’s knowledge or consent. This week, Google announced a new way that website developers can choose to allow its Bard and Vertex AI services access to their content, or opt out of training these API models.
How to Ban AI From Crawling Your Site
In a recent blog post, Google’s VP of Trust, Danielle Romain, acknowledged the desire of web publishers to have greater choice and control over how their content is used for emerging generative AI use cases. To address this concern, Google is allowing web publishers to disallow “User-Agent: Google-Extended” in their site’s robots.txt file. This simple step will prevent Google’s automated web crawlers from accessing and using the publisher’s content for AI training purposes.
Websites can currently provide a list of those who refuse to be web crawled via robots.txt, and Google believes that all AI model providers should also provide this kind of transparency and control. However, as AI applications expand, websites will face the increasing complexity of managing different uses at scale. Google said it would share more information as soon as possible.
Google’s Intent in Question
While Google claims to develop its AI model in an ethical and inclusive way, there is a fundamental difference between indexing the web and using data for AI training. The data collected from web publishers is used as raw material to train machine learning models, making them more accurate and powerful over time.
It is important to recognize the role of consent in the collection of AI training data. Giving web publishers the option to contribute to AI models is a positive step. However, Google already collects huge amounts of data to train its models without users’ explicit consent. This raises questions about the veracity of Google’s new focus on consent and ethical data collection.
The reality is that Google has unfettered access to web data and uses it to train AI models before seeking permission from web publishers. If ethical data collection and consent were indeed a top priority for Google, this option would have been available years ago.
Ethical Data Sourcing Might Still Have a Long Way to Go
Clearly, the tech industry needs to address the ethical implications of AI training and data collection. While Google’s move to give web publishers control over their content is a step in the right direction, it’s important to consider the need for a broader context and a more comprehensive solution.
Overall, Google’s decision to allow web publishers to control how their content is used for AI training is a positive development. However, it is important to recognize that this choice comes after Google has already collected and used vast amounts of data without explicit consent.
The entire tech industry needs to prioritize ethical data collection and consent, and work toward more comprehensive solutions that address the concerns of web publishers and users alike.