China’s Baichuan Intelligence Unveils Open Source Language Model

Baichuan Intelligence, a startup founded by Wang Xiaochuan, the founder of Sogou, has
introduced its next-generation large language model Baichuan-13B.
Baichuan Intelligence, a startup founded by Wang Xiaochuan, the founder of Sogou, has
introduced its next-generation large language model Baichuan-13B. Wang, a computer science
prodigy from Tsinghua University, aims to establish China’s version of OpenAI. Baichuan is
considered one of China’s most promising developers in the field of large language models
(LLMs). The model, based on the Transformer architecture like OpenAI’s GPT, has 13 billion
parameters and is trained on Chinese and English data. Baichuan-13B is open source and
optimised for commercial applications.
Training Data Comparable to GPT 3.5
Baichuan-13B is trained on 1.4 trillion tokens, surpassing Meta’s LLaMa, which uses 1 trillion
tokens in its 13 billion-parameter model. Wang has expressed his intention to release a large-
scale model comparable to OpenAI’s GPT-3.5 by the end of this year. Within a short period,
Baichuan has made significant progress, expanding its team to 50 people by the end of April
and launching its first LLM, Baichuan-7B, in June.
Baichuan-13B is now available for free to approved academics and developers who wish to use
it for commercial purposes. Notably, the model offers variations that can run on consumer-
grade hardware, addressing the constraints posed by U.S. AI chip sanctions on China.
Baichuan-7B is an open-source, large-scale pre-training language model, meticulously crafted by
the visionary minds at Baichuan Intelligent Technology. Rooted in the architecture of the
Transformer model, this model harnesses a staggering 7 billion parameters and has been
nourished with the exposure to a staggering 1.2 trillion tokens. With its unwavering versatility,
Baichuan-7B gracefully accommodates both the Chinese and English languages.
High Performance Scores Across the Board
Duly celebrated as a front-runner amongst models of similar scale, Baichuan-7B has emerged
victorious in renowned Chinese and English benchmarks, including the esteemed C-EVAL and
MMLU assessments, etching its name at the pinnacle of linguistic excellence.
This model consistently surpasses its counterparts of similar parameter magnitude, reigning
supreme as the pre-eminent native pre-trained model in the realm of Chinese language
comprehension. In the AGIEval assessment, Baichuan-7B outshines other open-source
contenders, including LLaMA-7B, Falcon-7B, Bloom-7B, and ChatGLM-6B, by an astonishing
margin, securing an impressive score of 34.4 points.
Baichuan-7B conquers the C-EVAL examination with a commanding score of 42.8 points,
outshining ChatGLM-6B’s 38.9 points. In the Gaokao evaluation, the model reigns supreme with
an exceptional score of 36.2 points, firmly establishing its dominance among pre-trained
models of comparable parameter scale.
AGIEval, a celebrated benchmark initiative by Microsoft Research Institute, represents an
exhaustive endeavour to assess the cognitive and problem-solving capacities of fundamental
models. C-Eval, a collaborative creation by Shanghai Jiao Tong University, Tsinghua University,
and the University of Edinburgh, serves as a comprehensive examination evaluating the prowess
of Chinese language models, encompassing 52 diverse subjects across various industries.
The Gaokao benchmark, crafted by the esteemed research team at Fudan University, leverages
the Chinese college entrance examination questions as a dataset, offering a rigorous
examination of large models’ aptitude in Chinese language comprehension and logical
reasoning.
Baichuan-7B’s mastery extends effortlessly into the realm of English. In the esteemed MMLU
assessment, baichuan-7B astounds with an extraordinary score of 42.5 points, effortlessly
surpassing the English open-source pre-trained model, LLaMA-7B, and the Chinese open-source
model, ChatGLM-6B, by significant margins.
A key determinant of success in large-scale model training resides within the training corpus
itself. Baichuan Intelligent Technology diligently constructs a high-quality pre-training corpus,
drawing from rich Chinese learning data and seamlessly integrating high-quality English data.
This data amalgamation encompasses a vast array of Chinese and English internet data, open-
source Chinese and English data, alongside a substantial corpus of meticulously curated
knowledge.

China’s Baichuan Intelligence Unveils Open Source Language Model to Compete with OpenAI

Signal Brief

Operating Footprint

Market Context

What To Watch

Deeper Trend Context

Strategic Circle

Leadership Alliance

Strategy Circle Briefing

Leadership Alliance Briefing