Close Menu
    Facebook LinkedIn YouTube Instagram X (Twitter)
    Blue Tech Wave Media
    Facebook LinkedIn YouTube Instagram X (Twitter)
    • Home
    • Leadership Alliance
    • Exclusives
    • Internet Governance
      • Regulation
      • Governance Bodies
      • Emerging Tech
    • IT Infrastructure
      • Networking
      • Cloud
      • Data Centres
    • Company Stories
      • Profiles
      • Startups
      • Tech Titans
      • Partner Content
    • Others
      • Fintech
        • Blockchain
        • Payments
        • Regulation
      • Tech Trends
        • AI
        • AR/VR
        • IoT
      • Video / Podcast
    Blue Tech Wave Media
    Home » AI Crisis caused by Data Exhaustion: How to Save an Impending Model Collapse
    AI-Crisis-caused-by-Data-Exhaustion
    AI

    AI Crisis caused by Data Exhaustion: How to Save an Impending Model Collapse

    By Bal MarsiusJuly 19, 2023Updated:September 7, 2023No Comments4 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email

    OpenAI’s ChatGPT technology has gone viral in just less than a year and is already having an impact on work patterns and the future of the industry.

    OpenAI’s ChatGPT technology has gone viral in just less than a year and is already having an impact on work patterns and the future of the industry. Within some of the world’s leading companies, as many as half of employees are already using this type of technology on a daily basis. Countless companies have invested in the field of AI, racing to launch new products, particularly in Internet, education, games and other growing industries.

    It is well known that the data used to train Large language models (LLMs) and other transformation models that support products such as ChatGPT, Stable Diffusion, and Midjourney originally came from human sources. These sources include books, articles, photographs, and other works that are entirely human original.

    The parameter sizes of large-scale models continue to grow, from billions and tens of billions to hundreds of billions. Adding to this explosion is the amount of data required to train AI that grows exponentially. Taking OpenAI’s GPT as an example, from GPT-1 to GPT-3, the size of the training dataset grew dramatically from 4.5GB to 570GB.

    Not long ago, at the Data+AI conference held by Databricks, Marc Andreessen, founder of a16z, believed that the massive data accumulated by the Internet in the past two decades is an important reason for the rise of a new wave of AI. He sees data as excellent sources of learning materials for AI training.

    However, despite the huge amount of useful and useless data left by netizens on the web, this data may be about to bottom out for AI training.

    A paper published by Epoch, an artificial intelligence research and prediction organization, predicts that high-quality textual data will run out between 2023 and 2027.

    While the research team acknowledges that the analytical methods have serious limitations and the model’s inaccuracies are high, it’s hard to deny that AI is consuming datasets at an alarming rate.

    Recently, researchers from the University of Cambridge, the University of Oxford, the University of Toronto and other universities published  an article pointing out that using AI-generated content to train AI can lead to the collapse of new models.

    The researchers concluded: “Learning from data generated by other models leads to model collapse – a degradation process in which the model forgets the true underlying data distribution over time. This process is inevitable, even in an ideal training situation for a long time.”

    What is the reason that using “generated data” to train AI will cause the model to collapse? Is there any way to prevent it?

    At this stage, AI is still in the primitive imitation of human thinking and its core is still a statistical program. Researchers believe that training AI with AI-generated content will produce “statistical approximation error”. This is because in the process of statistics, the content with higher probability is further strengthened, and the content with lower probability is continuously ignored, which is the main cause of model collapse.

    It affects the performance, reliability, and security of the model. The researchers warn that model collapse is a serious phenomenon that needs the attention of LLM developers and users.”We believe this problem will become one of the major challenges for the machine learning community in the next few years,” they said.

    But all hope is not lost.

    The first approach is data isolation. To address model collapse, the research team suggests separating clean artificially generated data sources from AI-generated content to prevent contamination of clean data by AIGC.

    The second is the use of synthetic data.  In fact, data generated specifically for AI is already widely used for AI training. For some practitioners, the current concern about AI-generated data leading to model collapse may be overblown. Therefore, the key is to establish an effective system to confirm the valid part of the AI-generated data and provide feedback based on the effectiveness of the trained model. OpenAI’s use of synthetic data for model training has become a consensus within the AI industry.

    In conclusion, despite the problem of human data depletion, AI training is not without solutions. Through the isolation of data and the use of synthetic data, the problem of model collapse can be effectively overcome and the continuous development of AI can be ensured.

    AI
    Bal Marsius

    Bal was BTW's copywriter specialising in tech and productivity tools. He has experience working in startups, mid-size tech companies, and non-profits.

    Related Posts

    CoreWeave acquires Core Scientific in $9bn AI infrastructure deal

    July 9, 2025

    OpenAI tightens security amid DeepSeek ‘copy’ allegations

    July 9, 2025

    Prestabist: Advances AI commerce tools across Africa

    July 9, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    CATEGORIES
    Archives
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023

    Blue Tech Wave (BTW.Media) is a future-facing tech media brand delivering sharp insights, trendspotting, and bold storytelling across digital, social, and video. We translate complexity into clarity—so you’re always ahead of the curve.

    BTW
    • About BTW
    • Contact Us
    • Join Our Team
    TERMS
    • Privacy Policy
    • Cookie Policy
    • Terms of Use
    Facebook X (Twitter) Instagram YouTube LinkedIn

    Type above and press Enter to search. Press Esc to cancel.