Instructions to use latincy/la_vectors with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use latincy/la_vectors with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("latincy/la_vectors", "model.bin")) - Notebooks
- Google Colab
- Kaggle
LatinCy Vectors
Static word vectors for Latin, trained on the LatinCy corpus. Provides Floret, FastText, Word2Vec, and GloVe embeddings trained on the same data and evaluated on the same benchmarks for direct cross-method comparison.
Available Models
All models are trained with CBOW architecture, 300 dimensions, window size 10, min count 50, 15 epochs, negative sampling 25 on the full LatinCy corpus (13.7M sentences, ~266M tokens from 9 sources).
| Model | Type | Vocab | HF Repo |
|---|---|---|---|
| Floret (lg) | Hash-based subword | 200k buckets | latincy/la_vectors_floret_lg |
| Floret (md) | Hash-based subword | 50k buckets | latincy/la_vectors_floret_md |
| FastText CBOW-300-10 | Subword (n-gram) | 233k words | latincy/la_vectors |
| Word2Vec CBOW-300-10 | Word-level | 233k words | latincy/la_vectors |
| GloVe 300 | Word-level (co-occurrence) | 233k words | latincy/la_vectors |
Floret vectors are distributed separately as spaCy pipeline components. FastText, Word2Vec, and GloVe are in a single umbrella repo (latincy/la_vectors).
Evaluation
Evaluated on curated Latin benchmarks: 1,383 analogy items across 11 categories and 2,728 odd-one-out items. Items unsolvable by all models are excluded per the evaluation methodology (1,330 analogies and 2,223 odd-one-out items are solvable).
| Model | Analogy Rank 1 | Analogy Rank 5 | Odd-One-Out |
|---|---|---|---|
| FastText CBOW-300-10 | 84.5% | 96.6% | 73.6% |
| Floret v3.9 (lg) | 81.4% | 95.3% | 74.0% |
| Word2Vec CBOW-300-10 | 70.2% | 91.3% | 79.1% |
| GloVe 300 | 49.5% | 79.2% | 75.1% |
FastText leads on analogy resolution due to subword information that captures Latin morphology. Word2Vec leads on odd-one-out (semantic clustering). GloVe is weaker on analogies because it lacks subword representations. Floret is used in LatinCy spaCy pipelines because it is 6x smaller than FastText while remaining competitive, and supports arbitrary vocabulary via hash-based lookups.
For full evaluation details including per-category breakdowns and nearest-neighbor spot checks, see the evaluation report.
Usage
From HuggingFace Hub
from huggingface_hub import hf_hub_download
# FastText binary model
path = hf_hub_download("latincy/la_vectors", "fasttext/la_fasttext_cbow_300_10.bin")
# Word2Vec text vectors
path = hf_hub_download("latincy/la_vectors", "word2vec/la_w2v_cbow_300_10.txt")
# GloVe vectors
path = hf_hub_download("latincy/la_vectors", "glove/la_glove_300.txt")
Floret (spaCy)
import spacy
nlp = spacy.load("la_vectors_floret_lg")
doc = nlp("rex populum regit")
for token in doc:
print(token.text, token.has_vector, token.vector[:5])
Training Corpus
All vectors are trained on the same corpus for valid cross-method comparison.
| Source | Sentences | Tokens |
|---|---|---|
| CC100-Latin | 6,507,840 | 128,886,505 |
| Latin Wikisource | 3,933,289 | 76,736,695 |
| Latin Wikipedia | 972,336 | 15,218,700 |
| CAMENA Neo-Latin | 736,400 | 9,970,933 |
| The Latin Library | 650,082 | 12,822,687 |
| CLTK-Tesserae | 516,930 | 6,626,484 |
| Perseus Digital Library | 223,535 | 4,317,063 |
| Patrologia Latina | 125,333 | 10,399,108 |
| UD Latin treebanks (6) | 55,332 | 980,787 |
| Total | 13,721,077 | 265,958,962 |
Citation
@misc{burns2023latincy,
title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
author = "Burns, Patrick J.",
year = "2023",
eprint = "2305.04365",
archivePrefix = "arXiv",
primaryClass = "cs.CL",
url = "https://arxiv.org/abs/2305.04365"
}
References
- Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. "Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin." In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf.
Acknowledgments
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
- Downloads last month
- 47
Paper for latincy/la_vectors
Evaluation results
- FastText CBOW-300-10 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)self-reported84.500
- Floret v3.9 (lg) Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)self-reported81.400
- Word2Vec CBOW-300-10 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)self-reported70.200
- GloVe 300 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)self-reported49.500
- Word2Vec CBOW-300-10 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)self-reported79.100
- GloVe 300 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)self-reported75.100
- Floret v3.9 (lg) on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)self-reported74.000
- FastText CBOW-300-10 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)self-reported73.600