ChatGPT and its analogs are mainly trained on English language. In this hierarchy, the Kazakh is secondary: the less data a language has in the training set, the worse the model understands its. Grammar, idioms, historical context, and legal terminology often lead to errors.
There is another point that is discussed less frequently. An AI model doesn’t just answer questions, it shapes the user’s worldview. If this worldview is built on data collected outside the country, with external priorities and interpretations of history, it is no longer just a technical issue. It is a question of whose values and meanings are being broadcasted by a tool used by millions.
What Kazakhstan already has
KazLLM was developed at the ISSAI under Nazarbayev University, featuring models with 8 billion and 70 billion parameters. AlemLLM has been adapted specifically for the Kazakh and Russian languages. These are not just concepts or presentations; they are functional systems already forming the foundation for government services.
It is important to understand why Gemini fails in a Kazakh context less often than other models, it has access to all of Google, including the Kazakh-speaking segment of the web. But even this is not a solution: Google makes decisions in the interest of Google, not in the interest of Kazakhstani users.
The world is already building its own
This is not a local initiative, it is a global trend unfolding right now. Germany is launching sovereign open models. Singapore created SEA-LION for Southeast Asian languages. The UAE is promoting the Arabic Falcon series. India is investing billions into BharatGen for 22 languages. Analysts estimate that by 2027, such models will appear in at least 25 countries.
The Turkic world as a potential market for Kazakhstani linguistic developments is not a fantasy, but a real opportunity if we act now.
The risk no one talks about
OpenAI records multi-billion dollar losses annually. Anthropic, Mistral, and most leading AI companies exist on venture capital and infrastructure subsidies from Microsoft, Amazon, and Google. This is not a secret, it is open financial reporting.
When investors eventually demand profitability and they will the first to suffer will be the markets that do not generate sufficient revenue. These are the non-English speaking, smaller markets without major corporate contracts. In this classification, Kazakhstan is not at the front of the line. Tariffs will rise, quality for peripheral languages will drop, and access may be restricted for political reasons. A sovereign model protects us not from technology, but from the commercial decisions of others.
What it means to train a model on Kazakh data
It is not simply "translating ChatGPT into Kazakh." A model trained on Kazakhstani laws, literature, medical documents, and state standards understands context differently. It won’t "hallucinate" Kazakh historical events based on a version from English Wikipedia. It won’t confuse legal norms with the regulations of another jurisdiction. For education, medicine, and public administration, this is the difference between a tool and a source of errors.
There is also a long-term argument. Every time a Kazakhstani user asks a foreign AI a question, they are "feeding" it their data. This data improves the foreign model. Having our own infrastructure means that the data of Kazakhstani users works for the Kazakhstani system rather than leaking abroad to make a competitor better.