To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage is profiled by BTW Media because published evidence links it to internet infrastructure, governance, operational dependencies, or market visibility.
To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage is tracked as a internet infrastructure institution within the internet infrastructure ecosystem.
To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage has public-source relevance to network operations, governance, dependency mapping, or market structure.
To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage has public-source relevance to network operations, governance, dependency mapping, or market structure.
To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage is tracked as a internet infrastructure institution within the internet infrastructure ecosystem.
Public-source signals support medium-impact monitoring for infrastructure visibility and dependency analysis.
To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage is profiled by BTW Media because published evidence links it to internet infrastructure, governance, operational dependencies, or market visibility.
Public-source signals support medium-impact monitoring for infrastructure visibility and dependency analysis.
| 0.90–1.00 | A | High — direct sources |
| 0.75–0.89 | A/B | Strong |
| 0.55–0.74 | B/C | Medium |
| 0.35–0.54 | C/D | Weak–medium |
| 0.10–0.34 | D | Weak signal |
| 0.00–0.09 | D | Internal monitoring |
Several public sources
- Google has confirmed unconfirmed reports of OpenAI’s activity on YouTube, stating that its robots.txt files and Terms of Service prohibit unauthorized content scraping or downloading.
- Meta faced limitations in training data availability and privacy concerns following the Cambridge Analytica scandal. The company considered purchasing book licenses or purchasing a publisher to catch up to OpenAI, and faced restrictions on consumer data usage.
The Wall Street Journal claimed earlier this week that AI companies are hitting a roadblock in collecting high-quality training data. The New York Times detailed some of the ways companies are dealing with this problem.
OpenAI needs training data
Desperate for training data, OpenAI developed the Whisper audio transcription model to overcome the odds, transcribing more than a million hours of YouTube video to train its state-of-the-art large-scale language model, GPT-4. According to The New York Times, the company knew that this would be legally problematic but considered it fair use. OpenAI spokesperson Lindsay Held told The Verge that the company curates “unique” datasets for each of its models to “help them understand the world” and keep its global research competitive.
According to the Times piece, the corporation runs out of relevant data in 2021 and talks of transcribing podcasts, audiobooks, and YouTube videos as a backup plan. By that time, Google had used information from Quizlet, a database of chess games, and computer code from Github to train its models.
Also read:Meta denies allowing Netflix access to users’ private information
Google’s response
Google spokesman Matt Bryant told The Verge in an email that the company had “seen unconfirmed reports,” adding that “both our robots.txt file and terms of service prohibit unauthorised scraping or downloading of YouTube content,” mirroring the company’s terms of use. Bryant said Google takes “technical and legal measures” to prevent such unauthorised use “when we have a clear legal or technical basis to do so.”
Google’s legal department has asked the company’s privacy team to adjust its policy language to expand its handling of consumer data, such as office tools like Google Docs, the Times writes. Google reportedly intends to release the new policy on July 1 to take advantage of the Independence Day weekend holiday distraction.
Also read:OpenAI voice-clone tool mimics your voice with 15-second sample
Meta’s response
Meta has similarly encountered the limitations of the availability of good training data, and in recordings heard by The Times, its AI team discusses the problem of using copyrighted works without permission as it tries to catch up with OpenAI. The company considered measures such as paying for book licences or even acquiring a major publisher outright. The company’s privacy reforms in the wake of the Cambridge Analytica scandal have also apparently limited the way it uses consumer data.
Google, OpenAI, and the broader field of AI training are struggling with rapidly evaporating training data for their models, and the more data those models absorb, the better. The Journal wrote this week that by 2028, companies could outpace the development of new content.
The Journal suggests ways to solve the problem of model errors, including synthetic data or course learning. However, neither method is proven. Companies can use whatever they find, with or without permission, but this is fraught with litigation.
At A Glance
- Name: To train GPT-4, OpenAI transcribed more over a million hours of YouTube footage
- Type: Internet infrastructure institution
- Base: Global
- Profile focus: Institution
What It Does
- Public records support monitoring of its role, services, and key relationships.
Why It Matters
- Public-source signals support medium-impact monitoring for infrastructure visibility and dependency analysis.
- Operational criticality: Medium
- Time horizon: Next quarter
What To Watch
- Monitoring focuses on verified service continuity, governance changes, and relationship signals.
Track verified source updates, role changes, and current public evidence.
Public-source signals support medium-impact monitoring for infrastructure visibility and dependency analysis.
Longer-term relevance depends on verified operating, policy, and relationship changes.
Member Briefing
Deeper Profile Context
Login is required to unlock the full profile briefing and source notes.
Only for Strategy Circle
Strategic Circle Access
Open to all readers. Unlock profile briefings after joining and logging in.
Join Strategic CircleOnly for Leadership Alliance
Leadership Alliance Access
For owners and management of IP-holding companies. Login required to unlock.
Join Leadership Alliance





