lid-neural-25.1

lid-neural-25 is a language identification model covering 25 languages, fine-tuned from xlm-roberta-base as a sequence classifier. Trades fastText's CPU-only footprint for meaningfully higher accuracy, especially on short, harder-to-classify text.

98.2% accuracy on short text (questions/queries) — vs. 97.3% for the fastText sibling, with the gap concentrated in exactly the languages that matter most (see Benchmarks).
Needs a GPU to fine-tune, CPU-fine for inference.
Two checkpoints for two input lengths, same split as the fastText sibling.

For a much smaller, CPU-only alternative at a small accuracy cost, see lid-lite-25 (fastText).

🗒️ Model Details

Model	Trained on	Use for
`lid-neural-25.1`	Long-form text (paragraph-length)	Documents, articles, passages
`lid-neural-25.2`	Short text (sentence/question-length)	Search queries, chat messages, short user input


Base model	`xlm-roberta-base`
Parameters	125M
Max sequence length	128 tokens
Languages	25 (see below)
Training data	`olaverse/qg-passages-multi`
Training	2 epochs, batch size 16 × grad-accum 4, lr 2e-5

Languages: af am de en fr ha hi ig id it ja ko nl pl pt ru sn so es sw tr vi xh yo zu (ISO 639-1; ISO 639-3: afr amh deu eng fra hau hin ibo ind ita jpn kor nld pol por rus sna som spa swh tur vie xho yor zul)

🏃 Usage

from transformers import pipeline

clf = pipeline("text-classification", model="olaverse/lid-neural-25.2")  # or .1 for passages
result = clf("What causes ocean tides?")
print(result)  # [{'label': 'eng', 'score': 0.999...}]

Use .1 for long-form text, .2 for short queries — see Model Details above.

📊 Benchmarks

Held-out validation split (5% of training data, not seen during training), same split used to evaluate lid-lite-25 for direct comparison.

`lid-neural-25.1` (passages) — overall accuracy: 99.9% (n=2,498)

Matches the fastText sibling at this input length. All languages ≥ 0.989 F1.

`lid-neural-25.2` (questions) — overall accuracy: 98.2% (n=7,419)

Language	lid-lite-25 F1	lid-neural-25.2 F1	Δ
amh	0.972	0.992	+0.020
jpn	0.911	0.997	+0.086
xho	0.780	0.786	+0.006
zul	0.769	0.789	+0.020
(remaining 21 languages)	≥0.983	≥0.993	small, consistent

The neural variant closes most of the fastText model's short-text gaps — notably Japanese (+8.6 F1 points) and Amharic — but does not fully resolve the Zulu/Xhosa confusion, which persists at a similar magnitude in both architectures.

⚠️ Known limitations

Zulu/Xhosa confusion on short text persists even with a much larger, fine-tuned transformer — both score F1 ~0.79 on lid-neural-25.2, versus ≥0.98 for every other language. Since two structurally different model families (linear n-gram classifier vs. transformer) show the same failure pattern, this looks like a genuine linguistic difficulty (Zulu and Xhosa are closely related Nguni languages with substantial shared vocabulary and orthography) rather than a fixable model weakness — treat predictions between these two specifically with reduced confidence on short input.
Same training-data caveats as lid-lite-25: machine/teacher-generated text, untested on human social media or code-switched input.
Requires a GPU to fine-tune further; inference is CPU-feasible but slower than lid-lite-25.

Training data & licensing

Fine-tuned from xlm-roberta-base (MIT) on olaverse/qg-passages-multi (Apache-2.0). Released under Apache-2.0.

Citation

@misc{lid-neural-25,
  title  = {lid-neural-25},
  author = {Olaverse},
  year   = {2026},
  url    = {https://huggingface.co/olaverse/lid-neural-25.1}
}

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for olaverse/lid-neural-25.1

Base model

FacebookAI/xlm-roberta-base

Finetuned

(4088)

this model

Collection including olaverse/lid-neural-25.1

LID

Collection

7 items • Updated about 23 hours ago