- Tumblr and WordPress.com are currently in discussions to provide user data to AI firms like OpenAI and Midjourney.
- The New York Times is currently suing OpenAI for allegedly using its expansive archives without permission to train chatbots
The use of scraped data from the internet has become a contentious issue, with companies harnessing public content to train their powerful generative models. This practice has sparked legal battles, as organizations like The New York Times and Getty Images have raised concerns about the unauthorized use of their content.
Legal battles over data usage
One of the prominent cases involves OpenAI, which is currently facing a lawsuit from The New York Times for allegedly utilizing the newspaper’s archives without permission to train chatbots. In response, OpenAI has accused The Times of resorting to questionable tactics to prove its claims. Similarly, Getty Images has taken legal action against Stable Diffusion for copyright infringement related to the use of its visual content.
The implications of AI systems leveraging the work of journalists, musicians, and photographers extend beyond legal disputes. The quest for vast amounts of training data has led to concerns about the potential exploitation of online content creators. Platforms like Tumblr and WordPress.com have reportedly been in talks to sell user data to AI companies like OpenAI and Midjourney, raising questions about data privacy and ownership.
Also read: Google’s Bard chatbot gets the Gemini Pro update globally
Partnerships in data sharing
While some entities have opted for litigation, others have chosen to forge partnerships. The Associated Press has licensed a portion of its archives to OpenAI, while Shutterstock inked a six-year deal with the AI company to provide access to its extensive library of photos, videos, and music.
Reddit, known for its wealth of user-generated content, recently struck a deal with Google, granting the tech giant access to its API for AI model training. This move underscores the value of user contributions to platforms and the ethical considerations surrounding data usage.
Also read: OpenAI launches GPT Store for personal AI chatbots without coding
Widespread data training practices
The widespread practice of training AI models on public internet data transcends specific deals highlighted in the article. A recent investigation by The Washington Post uncovered a trove of scraped data from various sources, including online forums, crowdfunding platforms, and social media sites. Companies like Meta, formerly Facebook, have also leveraged public posts from their platforms to enhance AI capabilities.
The debate over data ownership and consent remains unresolved. Content creators, whether on niche blogs or popular social media platforms, face the prospect of their work being commodified for AI training purposes. The balance between innovation and ethical data practices is crucial in shaping the future of AI development and its impact on digital ecosystems.






