Sora won’t replace humans, and here’s why

  • Sora is a video generation model released by OpenAI, capable of generating realistic videos based on input text prompts, sparking widespread attention and discussion.
  • While Sora represents a significant milestone in the field of artificial intelligence, the notion of completely replacing humans or disrupting reality is still premature, with the key challenge lying in constructing accurate and generalisable world models.
  • AI is a tool, a technology created by humans to aid in problem-solving. As it stands, as long as AI lacks self-awareness, it won’t possess “originality,” merely organising existing human knowledge.

OpenAI released the text-to-video model Sora in the early hours of February 16, causing a series of shocks and exclamations in the tech and media circles. At once, the explosion-like videos generated by Sora, as published on the OpenAI website, circulated widely online. Utilising Sora, one only needs to input a text prompt to obtain a video lasting up to 60 seconds, containing intricately detailed scenes, lively character expressions, and complex camera movements, almost indistinguishable from reality. Netizens exclaimed about AI revolutionising industries like film, short videos, and games, with some even exaggerating to say that ‘the real world no longer exists!’ The prospect of humans being replaced by AI seemed closer.

This event made us marvel at the new wave of technological revolution sparked by Sora, perhaps soon lowering the barrier for ordinary people to produce videos, with complex shooting and editing work being overlooked, and people’s imagination and creativity becoming the most essential sources of video content competitiveness. Consequently, ‘one-person companies’ and extremely small-scale teams also could complete films and video content that previously required enormous manpower and costs. The technological wave brings both admiration and anticipation, as well as concerns about being replaced and shattered.

Also read: 20 tech giants agree to fight election interference by AI

Sora does not understand the physical world and lacks a ‘world model’

However, in recent days, I have observed that scientists and many industry insiders standing at the technological forefront are still discussing Sora’s ‘world model’ issue the most. Videos generated by Sora have extremely lifelike visuals and coherence, with some almost indistinguishable from those created by humans. This is not simple; it requires machines to understand the structure, details, motion trajectories, and light and shadow changes of the real world, without violating human cognition. Some believe that Sora understands the physical world, possessing the embryonic form of a ‘world model.’ AI’s world model can be seen as its mental model, reflecting the artificial intelligence system’s understanding and expectations of itself and the external world. Taking the human world model as an example, the term ‘model’ implies that all the knowledge we understand is not stored as a pile of facts but organised in a structure that reflects the world and everything it contains. We do not remember a series of facts about each item but construct countless models in our brains, such as models of ‘city gates’ and ‘hipbone axes,’ each with its own shape, arrangement, and how different parts move and work together. To recognise something, we know its appearance and texture; to achieve a goal, we understand the typical behaviour of things in the world when interacting with us, such as what kind of bite marks an apple would have if bitten. However, many scientists believe that Sora does not understand the physical world and lacks a ‘world model.’

Turing Award winner Yann LeCun believes that generating realistic videos based solely on prompts does not necessarily indicate a model’s understanding of the physical world; the process of video generation is entirely different from causal predictions based on a world model.

Francois Chollet, the author of the deep learning framework ‘Keras‘ and a Google AI researcher, suggests that models like Sora may indeed embed a ‘physical model,’ but the question is: Is this physical model accurate? Can it generalise to new situations beyond just interpolating training data?

Sora-generated videos do exhibit several flaws, such as the POV shot of ants crawling in a nest having only four legs visible when closely examined; the video of a person running on a treadmill is in the opposite direction, and the video of ‘a large duck walks across the streets of Boston,’ the duck steps on a person.

Nvidia senior research scientist Jim Fan suggests two possible explanations for this issue: (1) The model may lack an understanding of physics, merely assembling image pixels randomly, or (2) The model attempts to construct an internal physics engine, but its performance is subpar.

Industry insiders believe that Sora employs a ‘brute force’ approach, leveraging vast amounts of data, large models, and considerable computational power, with the underlying use of world models validated in the fields of gaming, autonomous driving, and robotics to construct the text-to-video model, enabling it to simulate the world.

However, this is akin to learning the laws of the world through extensive ‘image reading,’ which, although reasonable, cannot learn world laws deducible by physics, such as Newton’s laws.

Ultimately, humans did not invent aeroplanes by mimicking birds but by understanding aerodynamics. Sora indeed marks another milestone in AI, promising to greatly simplify human labour, reduce human ‘tool-like’ attributes, and assist or partially assume certain tasks. However, true human replacement or reality disruption seems premature.

Also read: Can an AI chatbot do the job of 700 people?

Pop quiz

How long can Sora-generated videos last?

A. 60 seconds

B. 2 minutes

C. 4 minutes

D. 10 minutes

The correct answer is at the bottom of the article. 

AIGC could be a strong tool for highly original content creators

The further development of AIGC (including but not limited to Sora) will drive the reshuffling process towards a direction more favourable to diversity. We might use a highly simplified analytical model to divide the capabilities of internet-native content creators into two directions. First, hotspot sensitivity, referring to the ability to chase after hot topics and trends. Undoubtedly, at any given time, the majority of social media traffic is concentrated on very few hot topics. The ability to grasp these hot topics determines the creator’s short-term explosiveness, or in trendier terms, their ‘viral potential.’ Second, content tonality, pertaining to the uniqueness and irreplaceability of content. Some creators’ content is unforgettable, bearing distinct personal imprints that no competitor can imitate. Whether they possess enough irreplaceable tonality determines the creator’s endurance, or what we might call ‘sustainability’ or ‘fan stickiness.’

AIGC will benefit niche content creators who excel in content tonality and gradually gain popularity, while it will disadvantage those who thrive on capturing hot topics for short-lived trends. In the era of AIGC, chasing hot topics will no longer be a core competitive advantage for content creators as the threshold for doing so will decrease. Consequently, the importance of content tonality will rise further, potentially becoming the sole winning card.

The timely coverage of hot topics will primarily be the task of AI, with the main competition being the efficiency of AIGC, making it difficult for anyone to stand out. However, for content creators whose core competitiveness lies in their tonality, AIGC can become a powerful new weapon. Internet users still have a natural tendency to chase after hot topics, but what they will increasingly demand is not timely content but rather distinctive interpretations or in-depth analyses.

Just like football enthusiasts who have shifted their focus from rapid, comprehensive news coverage to in-depth match analysis and interactive, entertaining programs. High-quality niche creators can collaborate with AI: the former focusing on tonality, the so-called ‘flashes of inspiration,’ and the latter handling repetitive tasks, termed ‘grunt work’ in the content industry.

AI still has a long way to go

Sora is a game-changer. Knowing how Hollywood works, they will most certainly try to use it to replace jobs. But it is a tool, and some people will grab onto this and run with it, and some won’t. I have yet to see any AI full of human emotion; it’s all pretty creepy at the moment. It doesn’t become a full threat until it can make humans feel.

Lee Romaire, creative producer and Emmy-winner

AI is a tool, a technological means created by humans to solve problems. As it stands, as long as AI does not develop self-awareness, it will not possess ‘originality’ but will merely collect and organise existing human knowledge. Even advanced generative AI like ChatGPT are no exception.

Emmy-winner Lee Romaire shared his view on this, saying “Sora is a game-changer. Knowing how Hollywood works, they will most certainly try to use it to replace jobs. But it is a tool, and some people will grab onto this and run with it, and some won’t. I have yet to see any AI full of human emotion; it’s all pretty creepy at the moment. It doesn’t become a full threat until it can make humans feel.”

OpenAI has already disclosed Sora’s technical details, revealing that its technological roadmap inherits from the previously published text-to-image model. While there are some innovations, they are not revolutionary. At least in the current environment, Sora is unlikely to produce true ‘originality,’ and its efficiency and persuasiveness in generating videos still heavily rely on individual user ‘training.’

The correct answer is A.


Chloe Chen

Chloe Chen is a junior writer at BTW Media. She graduated from the London School of Economics and Political Science (LSE) and had various working experiences in the finance and fintech industry. Send tips to

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *