lid-neural-25.2

lid

lid-neural-25 is a language identification model covering 25 languages, fine-tuned from xlm-roberta-base as a sequence classifier. Trades fastText's CPU-only footprint for meaningfully higher accuracy, especially on short, harder-to-classify text.

  • 98.2% accuracy on short text (questions/queries) β€” vs. 97.3% for the fastText sibling, with the gap concentrated in exactly the languages that matter most (see Benchmarks).
  • Needs a GPU to fine-tune, CPU-fine for inference.
  • Two checkpoints for two input lengths, same split as the fastText sibling.

For a much smaller, CPU-only alternative at a small accuracy cost, see lid-lite-25 (fastText).

πŸ—’οΈ Model Details

Model Trained on Use for
lid-neural-25.1 Long-form text (paragraph-length) Documents, articles, passages
lid-neural-25.2 Short text (sentence/question-length) Search queries, chat messages, short user input
Base model xlm-roberta-base
Parameters 125M
Max sequence length 128 tokens
Languages 25 (see below)
Training data olaverse/qg-passages-multi
Training 2 epochs, batch size 16 Γ— grad-accum 4, lr 2e-5

Languages: af am de en fr ha hi ig id it ja ko nl pl pt ru sn so es sw tr vi xh yo zu (ISO 639-1; ISO 639-3: afr amh deu eng fra hau hin ibo ind ita jpn kor nld pol por rus sna som spa swh tur vie xho yor zul)

πŸƒ Usage

from transformers import pipeline

clf = pipeline("text-classification", model="olaverse/lid-neural-25.2")  # or .1 for passages
result = clf("What causes ocean tides?")
print(result)  # [{'label': 'eng', 'score': 0.999...}]

Use .1 for long-form text, .2 for short queries β€” see Model Details above.

πŸ“Š Benchmarks

Held-out validation split (5% of training data, not seen during training), same split used to evaluate lid-lite-25 for direct comparison.

lid-neural-25.1 (passages) β€” overall accuracy: 99.9% (n=2,498)

Matches the fastText sibling at this input length. All languages β‰₯ 0.989 F1.

lid-neural-25.2 (questions) β€” overall accuracy: 98.2% (n=7,419)

Language lid-lite-25 F1 lid-neural-25.2 F1 Ξ”
amh 0.972 0.992 +0.020
jpn 0.911 0.997 +0.086
xho 0.780 0.786 +0.006
zul 0.769 0.789 +0.020
(remaining 21 languages) β‰₯0.983 β‰₯0.993 small, consistent

The neural variant closes most of the fastText model's short-text gaps β€” notably Japanese (+8.6 F1 points) and Amharic β€” but does not fully resolve the Zulu/Xhosa confusion, which persists at a similar magnitude in both architectures.

⚠️ Known limitations

  • Zulu/Xhosa confusion on short text persists even with a much larger, fine-tuned transformer β€” both score F1 ~0.79 on lid-neural-25.2, versus β‰₯0.98 for every other language. Since two structurally different model families (linear n-gram classifier vs. transformer) show the same failure pattern, this looks like a genuine linguistic difficulty (Zulu and Xhosa are closely related Nguni languages with substantial shared vocabulary and orthography) rather than a fixable model weakness β€” treat predictions between these two specifically with reduced confidence on short input.
  • Same training-data caveats as lid-lite-25: machine/teacher-generated text, untested on human social media or code-switched input.
  • Requires a GPU to fine-tune further; inference is CPU-feasible but slower than lid-lite-25.

Training data & licensing

Fine-tuned from xlm-roberta-base (MIT) on olaverse/qg-passages-multi (Apache-2.0). Released under Apache-2.0.

Citation

@misc{lid-neural-25,
  title  = {lid-neural-25},
  author = {Olaverse},
  year   = {2026},
  url    = {https://huggingface.co/olaverse/lid-neural-25.1}
}
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for olaverse/lid-neural-25.2

Finetuned
(4087)
this model

Collection including olaverse/lid-neural-25.2