Trends
Who is selling your data to train AI?
The use of scraped data from the internet has become a contentious issue, with companies harnessing public content to train their powerful generative models. This practice has sparked legal battles, as organizations like The New York Times and Getty Images have raised concerns about the unauthorized…

Headline
The use of scraped data from the internet has become a contentious issue, with companies harnessing public content to train their powerful generative models. This practice has sparked legal battles, as organizations like The New York Times and Getty Images have raised concerns…
Context
The use of scraped data from the internet has become a contentious issue, with companies harnessing public content to train their powerful generative models. This practice has sparked legal battles, as organizations like The New York Times and Getty Images have raised concerns about the unauthorized use of their content. One of the prominent cases involves OpenAI, which is currently facing a lawsuit from The New York Times for allegedly utilizing the newspaper’s archives without permission to train chatbots. In response, OpenAI has accused The Times of resorting to questionable tactics to prove its claims. Similarly, Getty Images has taken legal action against Stable Diffusion for copyright infringement related to the use of its visual content.
Evidence
Pending intelligence enrichment.
Analysis
The implications of AI systems leveraging the work of journalists, musicians, and photographers extend beyond legal disputes. The quest for vast amounts of training data has led to concerns about the potential exploitation of online content creators. Platforms like Tumblr and WordPress.com have reportedly been in talks to sell user data to AI companies like OpenAI and Midjourney, raising questions about data privacy and ownership. Also read: Google’s Bard chatbot gets the Gemini Pro update globally While some entities have opted for litigation, others have chosen to forge partnerships. The Associated Press has licensed a portion of its archives to OpenAI, while Shutterstock inked a six-year deal with the AI company to provide access to its extensive library of photos, videos, and music. Reddit, known for its wealth of user-generated content, recently struck a deal with Google, granting the tech giant access to its API for AI model training. This move underscores the value of user contributions to platforms and the ethical considerations surrounding data usage.
Key Points
- Tumblr and WordPress.com are currently in discussions to provide user data to AI firms like OpenAI and Midjourney.
- The New York Times is currently suing OpenAI for allegedly using its expansive archives without permission to train chatbots
Actions
Pending intelligence enrichment.





