Pleias-RAG-1B-gguf

Pleias-RAG-1B-gguf is a 1.2 billion parameters Small Reasoning Model, trained for retrieval-augmented general (RAG), search and source summarization. Along with Pleias-RAG-1B it belongs to the first generation of Pleias specialized reasoning models.

Pleias-RAG-1B outperform most SLMs (4 billion parameters and below) on standardized benchmarks for retrieval-augmented general (HotPotQA, 2wiki) and is competitive with standard 7-8b models including Qwen-2.5-7B and Llama-3.1-8B. It is the only SLM to date to maintain consistent RAG performance across leading European languages and to ensure systematic reference grounding for statements.

Due to its size, ease of deployment on constrained infrastructure (including mobile phone) and built-in support for factual and accurate information, Pleias-RAG-1B unlock a range of new use cases for generative AI.

This version is unquantized but, thanks to the small size of the original model, still runnable on most CPU infrastructure under reasonable time without any loss of performance.

Features

Pleias-RAG-1B is a specialized language model using a series of special tokens to process a structured input (query and sources) and generate a structured output (reasoning sequence and answer with sources). For easier implementation, we encourage to use the associated API library.

Citation support

Pleias-RAG-1B natively generated grounded answers on the basis of excerpts and citations extracted from the provided sources, using a custom syntax inspired by Wikipedia () It is one a handful open weights model to date to have been developed with this feature and the first one designed for actual deployment.

In contrast with Anthropic approach (Citation mode), citation are integrally generated by the model and are not the product of external chunking. As a result we can provide another desirable feature to simplify source checking: citation shortening for longer excerpts (using "(…)").

RAG reasoning

Pleias-RAG-1B generates a specific reasoning sequences incorporating several proto-agentic abilities for RAG applications. The model is able to make a series of decisions directly:

Assessing whether the query is understandable.
Assessing whether the query is trivial enough to not require a lengthy pre-analysis (adjustable reasoning)
Assessing whether the sources do contain enough input to generate a grounded answer.

The structured reasoning traces include the following steps:

Language detection of the query. The model will always strive to answer in the language of the original query.
Query analysis and associated query report. The analysis can either lead to a standard answer, a shortening reasoning trace/answer for trivial question, a reformulated query or a refusal (that could in the context of the application be transformed into user input querying).
Source analysis and associated source report. This step evaluates the coverage and depth of the provided sources in regards to the query.
Draft of the final answer.

Multilinguality

Pleias-RAG-1B is able to read and write in the main European languages: French, German, Italian, Spanish, Polish, Latin and Portuguese.

To date, it is the only SLM with negligible loss of performance in leading European languages for RAG-related tasks. On a translated set of HotPotQA we observed a significant drop of performance in most SLMs from 10% to 30-35% for sub-1B models.

We do expect the results of any standard English evaluation on Pleias RAG models should be largely transferable to the main European languages limiting the costs of evaluation and deployment in multilingual settings.

Training

Pleias-RAG-1B is trained on large synthetic dataset emulating retrieval of wide variety of multilingual open sources from Common Corpus. They provide native support for citation and grounding with literal quotes. Following on the latest trends of agentification, the models reintegrate multiple features associated with RAG workflows such as query routing, query reformulation, source reranking.

Evaluation

Pleias-RAG-1B has been evaluated on three standard RAG benchmarks, 2wiki, HotpotQA and MuSique.

All the benchmarks only assess the "trivial" mode on questions requiring some form of multi-hop reasoning over sources (answer disseminated into different sources) as well as discrimination of distractor sources.

Deployment

With 1.2B parameters, Pleias-RAG-1B can be readily deployed in many constrained infrastructures, including desktop systems on CPU RAM.

We also release an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.

Once integrated into a RAG system, Pleias-RAG-1B can also be used in a broader range of non-conversational use cases including user support or educational assistance. Through this release, we aims to make SLMs workable in production by relying systematically on an externalized memory.

PleIAs
/

Pleias-RAG-1B-gguf