OpenAI Data Partnerships for Global AI Training

OpenAI’s “Data Partnerships” program aims to reduce Western-centric biases in AI by creating diverse, global datasets.
The initiative focuses on incorporating varied languages and cultural data to address existing biases in AI models.
OpenAI faces criticism and legal issues for potentially using personal and creative works in AI training without authorization.

OpenAI has announced a “Data Partnerships” initiative, whose aim is to expand the diversity of AI training data beyond the Western-centric norm. This groundbreaking program is designed to partner with various organisations to develop comprehensive public and private datasets for AI model training.

openai-data-partnerships — OpenAI Data Partnerships (https://openai.com/blog/data-partnerships)

Addressing data bias in AI

The initiative comes in response to the prevalent issue of data bias in AI. Traditional AI models have shown a significant skew towards data from Western countries, particularly in image databases. This bias is attributed to the overrepresentation of Western imagery on the internet, resulting in AI models that inadvertently amplify these biases, potentially leading to harmful outcomes.

OpenAI’s Data Partnerships aim to rectify this by gathering extensive datasets that more accurately reflect global human society. These datasets will focus on capturing human intent through diverse formats like extensive writings or dialogues across various languages and subjects. This broader dataset will assist AI models in achieving a deeper understanding of diverse subjects, industries, cultures, and languages.

Also read: OpenAI launches GPT Store for personal AI chatbots without coding

Public and private data collection

The program will work across multiple modalities, including images, audio, and video, prioritizing data that represents human intentions, such as long-form writing or conversations. To ensure the data’s integrity, OpenAI plans to use tools like Optical Character Recognition and Automatic Speech Recognition for digitization, while also being mindful of removing sensitive or personal information.OpenAI plans to develop two types of datasets. The first type is open-source datasets, which will be freely available for AI training purposes. The second type is private datasets, tailored for organizations wanting to maintain data confidentiality while allowing OpenAI models to better understand their specific domains.

Also read: OpenAI quietly updates its ‘core values’ to emphasize AGI development

Collaborations and controversies

The company has already embarked on partnerships to enhance its AI capabilities. Collaborations with the Icelandic government and Miðeind ehf have improved the Icelandic language proficiency of GPT-4. Similarly, a partnership with the Free Law Project has enhanced the model’s understanding of legal documents.Despite the seemingly altruistic nature of this initiative, OpenAI faces criticism for potential commercial motives. The approach of improving OpenAI’s models, potentially at the expense of others without fair compensation, has sparked controversy. Recent legal actions against OpenAI and Microsoft by creators and authors have highlighted issues regarding the unauthorized use of their works for training AI models, raising questions about ethical data use and compensation in the AI industry.