Tech giants used unauthorized data to train AI model

Some of the Tech giants allegedly used YouTube transcripts without permission to train AI models.
The legality of using unauthorised databases to train AI is undetermined, potentially hindering future AI development.

OUR TAKE
The development of AI technology is certainly promising, but its creation and advancement are built on databases. The lack of transparency in these databases is bound to cause controversy. The affected parties and the infringing companies often hold conflicting views, with no definitive resolution in sight. This situation is like a Damocles sword hanging over the industry; if not addressed, it will inevitably hinder the continuous development of AI.
— Yasmine luo, BTW reporter

What happened?

Some major tech companies are accused of using YouTube transcripts without authorization to train their AI models.

According to Proof News, EleutherAI, a nonprofit organisation, created a dataset containing transcripts from over 48,000 YouTube channels, including content from prominent creators like Marques Brownlee and MrBeast, as well as major publishers like The New York Times, the BBC, and ABC News. According to a new investigation by Proof News, Apple, NVIDIA, Anthropic, and other large tech companies used this dataset to train their AI models.

Neal Mohan, CEO of YouTube, has previously stated, “Companies using YouTube’s data to train AI models would violate the platform’s terms of service.”

Marques Brownlee, a famous YouTuber, posted on social media, “Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine. Apple technically avoids ‘fault’ here because they’re not the ones scraping. But this is going to be an evolving problem for a long time.”

Currently, Apple, NVIDIA, Anthropic, and EleutherAI have not commented on the matter.

Also read: Warburg-backed PDG eyes AI-driven data centre expansion in Asia

Also read: OpenAI’s ‘Strawberry’ project advances AI reasoning

Why it’s important

The rapid growth of AI models, while promising to shape the future, has also raised numerous unresolved legal questions. The recent accusations against tech giants add to these concerns. Since its inception, AI technology has grappled with the issue of non-transparent training databases. If AI training data is not appropriately sourced, there is a risk of copyright or database right infringement.

However, it remains undetermined whether the companies involved will face legal charges. The Verge conducted an investigation among lawyers, analysts, and employees at AI startups, revealing divided opinions on this issue.

“I see people on both sides of this extremely confident in their positions, but the reality is nobody knows,” says Baio, an AI observer.

Although the affected companies or individuals claim that it’s illegal, their demands are unlikely to be addressed, as evidenced by the lack of response from the accused companies.

If this issue remains unresolved, it may one day hinder the continuous development of AI technology.

Tech giants accused of using unauthorised YouTube transcripts to train AI models

Anthropic considers developing its own AI chips

OpenAI pauses UK data centre plans as energy costs bite

Meta deepens AI cloud push with $21bn CoreWeave deal