To train GPT-4, OpenAI transcribed more over a million hours

Google has confirmed unconfirmed reports of OpenAI’s activity on YouTube, stating that its robots.txt files and Terms of Service prohibit unauthorized content scraping or downloading.
Meta faced limitations in training data availability and privacy concerns following the Cambridge Analytica scandal. The company considered purchasing book licenses or purchasing a publisher to catch up to OpenAI, and faced restrictions on consumer data usage.

The Wall Street Journal claimed earlier this week that AI companies are hitting a roadblock in collecting high-quality training data. The New York Times detailed some of the ways companies are dealing with this problem.

OpenAI needs training data

Desperate for training data, OpenAI developed the Whisper audio transcription model to overcome the odds, transcribing more than a million hours of YouTube video to train its state-of-the-art large-scale language model, GPT-4. According to The New York Times, the company knew that this would be legally problematic but considered it fair use. OpenAI spokesperson Lindsay Held told The Verge that the company curates “unique” datasets for each of its models to “help them understand the world” and keep its global research competitive.

According to the Times piece, the corporation runs out of relevant data in 2021 and talks of transcribing podcasts, audiobooks, and YouTube videos as a backup plan. By that time, Google had used information from Quizlet, a database of chess games, and computer code from Github to train its models.

Also read：Meta denies allowing Netflix access to users’ private information

Google’s response

Google spokesman Matt Bryant told The Verge in an email that the company had “seen unconfirmed reports,” adding that “both our robots.txt file and terms of service prohibit unauthorised scraping or downloading of YouTube content,” mirroring the company’s terms of use. Bryant said Google takes “technical and legal measures” to prevent such unauthorised use “when we have a clear legal or technical basis to do so.”

Google’s legal department has asked the company’s privacy team to adjust its policy language to expand its handling of consumer data, such as office tools like Google Docs, the Times writes. Google reportedly intends to release the new policy on July 1 to take advantage of the Independence Day weekend holiday distraction.

Also read：OpenAI voice-clone tool mimics your voice with 15-second sample

Meta’s response

Meta has similarly encountered the limitations of the availability of good training data, and in recordings heard by The Times, its AI team discusses the problem of using copyrighted works without permission as it tries to catch up with OpenAI. The company considered measures such as paying for book licences or even acquiring a major publisher outright. The company’s privacy reforms in the wake of the Cambridge Analytica scandal have also apparently limited the way it uses consumer data.

Google, OpenAI, and the broader field of AI training are struggling with rapidly evaporating training data for their models, and the more data those models absorb, the better. The Journal wrote this week that by 2028, companies could outpace the development of new content.

The Journal suggests ways to solve the problem of model errors, including synthetic data or course learning. However, neither method is proven. Companies can use whatever they find, with or without permission, but this is fraught with litigation.

To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage

OpenAI needs training data

Google’s response

Meta’s response

At A Glance

What It Does

Why it matters

What To Watch

Deeper Profile Context

Strategic Circle

Leadership Alliance

Strategy Circle Briefing

Leadership Alliance Briefing