Large language models (LLMs) have recently taken center stage on Europe’s digital sovereignty agenda with the launch of a new program called OpenEuroLLM. This initiative aims to develop a series of open source LLMs covering all European Union languages, including the official 24 EU languages and languages for countries seeking entry into the EU market like Albania. The project, co-led by Jan HajiÄŤ and Peter Sarlin, involves collaboration between 20 organizations and is part of Europe’s broader push for digital sovereignty.
The OpenEuroLLM project has a budget of €37.4 million, with funding coming from the EU’s Digital Europe Programme. Partners include EuroHPC supercomputer centers in several European countries. Despite the ambitious goal of developing multilingual LLMs, some have raised concerns about the project’s feasibility due to the involvement of multiple organizations with different priorities.
Jan HajiÄŤ, who is also coordinating the High Performance Language Technologies (HPLT) project, sees OpenEuroLLM as a continuation of HPLT with a focus on generative LLMs. The project aims to release the first versions by mid-2026 and final iterations by 2028. While starting from scratch in terms of data and tools, the project benefits from the expertise of its partners.
Participating organizations include academic and research institutions from Czechia, the Netherlands, Germany, Sweden, Finland, and Norway, as well as corporate entities like Silo AI, Aleph Alpha, Ellamind, Prompsit Language Engineering, and LightOn. Notably absent from the list is Mistral, a French AI company known for its open source approach. While efforts were made to involve Mistral in the project, discussions did not progress.
The project’s ultimate goal is to create foundation models for transparent AI in Europe that preserve linguistic and cultural diversity. This includes developing a core multilingual LLM for general-purpose tasks and smaller, more efficient versions for edge applications. Detailed plans for the project’s deliverables are still in development, with a focus on balancing size and quality. The OpenEuroLLM project is striving to create a large language model that is proficient in all languages, with a particular focus on ensuring equality across the board. However, achieving this goal may be challenging, especially for languages with limited digital resources. To address this, the project is working on establishing true benchmarks that are representative of each language and its cultural nuances.
One of the key components of the project is the data it utilizes. The HPLT project has released version 2.0 of its dataset, which includes 4.5 petabytes of web crawls and over 20 billion documents. Additionally, data from Common Crawl, an open repository of web-crawled data, will be incorporated into the mix to further enhance the model’s training.
In the realm of open source AI, there has been a debate about what constitutes true openness. While the Open Source Initiative has defined open source AI, there are differing opinions on whether training data should be included in the definition. The OpenEuroLLM project aims to be as open as possible, but certain limitations may require them to keep some training data confidential, although it will be accessible for auditing purposes as per EU regulations.
Despite its commitment to openness, the OpenEuroLLM project has faced criticism for similarities to the EuroLLM project, which launched earlier in Europe with EU funding. The two projects share common goals of creating open source language models for European languages, but due to funding restrictions, collaborations between the two may be limited.
In terms of funding, the OpenEuroLLM project is confident that it will have sufficient resources to support its goals. Partnering with EuroHPC centers, which have invested billions in AI and compute infrastructure, will provide the necessary funding for the project. The focus of the project is on building foundational models rather than consumer or enterprise-grade products, which helps streamline the budget allocation.
Overall, the OpenEuroLLM project is dedicated to creating a high-quality, open source language model that can serve as a foundational AI infrastructure for companies in Europe. With a strong focus on data quality, cultural representation, and collaboration within the EU, the project aims to make significant strides in the field of AI language models. The upcoming Europa models from OpenEuroLLM are set to revolutionize language processing in Europe. These new models will support all European languages, building upon the foundation laid by the current models that already cover a handful of European languages. This advancement aligns with the vision of not starting from scratch, as emphasized by HajiÄŤ, who recognizes the existing expertise and technology in place.
Critics have pointed out the complexity of OpenEuroLLM, but HajiÄŤ views this as a positive aspect. He believes that collaborative projects, leveraging both academic expertise and industry focus, can bring about innovative solutions. The goal is not to compete with Big Tech or billion-dollar AI startups, but to achieve digital sovereignty through (mostly) open foundation LLMs developed by and for Europe.
HajiÄŤ emphasizes the importance of having a European-based model, even if it may not be the top performer globally. The focus is on creating a model that encompasses all necessary components within Europe, ensuring a positive outcome regardless of rankings. This commitment to digital sovereignty sets OpenEuroLLM apart and paves the way for a new era of language processing technology in Europe.