lid-neural-25.1
lid-neural-25 is a language identification model covering 25 languages,
fine-tuned from xlm-roberta-base as a sequence classifier. Trades fastText's
CPU-only footprint for meaningfully higher accuracy, especially on short,
harder-to-classify text.
- 98.2% accuracy on short text (questions/queries) β vs. 97.3% for the fastText sibling, with the gap concentrated in exactly the languages that matter most (see Benchmarks).
- Needs a GPU to fine-tune, CPU-fine for inference.
- Two checkpoints for two input lengths, same split as the fastText sibling.
For a much smaller, CPU-only alternative at a small accuracy cost, see
lid-lite-25 (fastText).
ποΈ Model Details
| Model | Trained on | Use for |
|---|---|---|
lid-neural-25.1 |
Long-form text (paragraph-length) | Documents, articles, passages |
lid-neural-25.2 |
Short text (sentence/question-length) | Search queries, chat messages, short user input |
| Base model | xlm-roberta-base |
| Parameters | 125M |
| Max sequence length | 128 tokens |
| Languages | 25 (see below) |
| Training data | olaverse/qg-passages-multi |
| Training | 2 epochs, batch size 16 Γ grad-accum 4, lr 2e-5 |
Languages: af am de en fr ha hi ig id it ja ko nl pl pt ru sn so es sw tr vi xh yo zu (ISO 639-1; ISO 639-3: afr amh deu eng fra hau hin ibo ind ita jpn kor nld pol por rus sna som spa swh tur vie xho yor zul)
π Usage
from transformers import pipeline
clf = pipeline("text-classification", model="olaverse/lid-neural-25.2") # or .1 for passages
result = clf("What causes ocean tides?")
print(result) # [{'label': 'eng', 'score': 0.999...}]
Use .1 for long-form text, .2 for short queries β see Model Details above.
π Benchmarks
Held-out validation split (5% of training data, not seen during training),
same split used to evaluate lid-lite-25 for direct comparison.
lid-neural-25.1 (passages) β overall accuracy: 99.9% (n=2,498)
Matches the fastText sibling at this input length. All languages β₯ 0.989 F1.
lid-neural-25.2 (questions) β overall accuracy: 98.2% (n=7,419)
| Language | lid-lite-25 F1 | lid-neural-25.2 F1 | Ξ |
|---|---|---|---|
| amh | 0.972 | 0.992 | +0.020 |
| jpn | 0.911 | 0.997 | +0.086 |
| xho | 0.780 | 0.786 | +0.006 |
| zul | 0.769 | 0.789 | +0.020 |
| (remaining 21 languages) | β₯0.983 | β₯0.993 | small, consistent |
The neural variant closes most of the fastText model's short-text gaps β notably Japanese (+8.6 F1 points) and Amharic β but does not fully resolve the Zulu/Xhosa confusion, which persists at a similar magnitude in both architectures.
β οΈ Known limitations
- Zulu/Xhosa confusion on short text persists even with a much larger,
fine-tuned transformer β both score F1 ~0.79 on
lid-neural-25.2, versus β₯0.98 for every other language. Since two structurally different model families (linear n-gram classifier vs. transformer) show the same failure pattern, this looks like a genuine linguistic difficulty (Zulu and Xhosa are closely related Nguni languages with substantial shared vocabulary and orthography) rather than a fixable model weakness β treat predictions between these two specifically with reduced confidence on short input. - Same training-data caveats as
lid-lite-25: machine/teacher-generated text, untested on human social media or code-switched input. - Requires a GPU to fine-tune further; inference is CPU-feasible but slower
than
lid-lite-25.
Training data & licensing
Fine-tuned from xlm-roberta-base (MIT) on
olaverse/qg-passages-multi
(Apache-2.0). Released under Apache-2.0.
Citation
@misc{lid-neural-25,
title = {lid-neural-25},
author = {Olaverse},
year = {2026},
url = {https://huggingface.co/olaverse/lid-neural-25.1}
}
- Downloads last month
- -
Model tree for olaverse/lid-neural-25.1
Base model
FacebookAI/xlm-roberta-base