LatinCy Vectors

Static word vectors for Latin, trained on the LatinCy corpus. Provides Floret, FastText, Word2Vec, and GloVe embeddings trained on the same data and evaluated on the same benchmarks for direct cross-method comparison.

Available Models

All models are trained with CBOW architecture, 300 dimensions, window size 10, min count 50, 15 epochs, negative sampling 25 on the full LatinCy corpus (13.7M sentences, ~266M tokens from 9 sources).

Model	Type	Vocab	HF Repo
Floret (lg)	Hash-based subword	200k buckets	`latincy/la_vectors_floret_lg`
Floret (md)	Hash-based subword	50k buckets	`latincy/la_vectors_floret_md`
FastText CBOW-300-10	Subword (n-gram)	233k words	`latincy/la_vectors`
Word2Vec CBOW-300-10	Word-level	233k words	`latincy/la_vectors`
GloVe 300	Word-level (co-occurrence)	233k words	`latincy/la_vectors`

Floret vectors are distributed separately as spaCy pipeline components. FastText, Word2Vec, and GloVe are in a single umbrella repo (latincy/la_vectors).

Evaluation

Evaluated on curated Latin benchmarks: 1,383 analogy items across 11 categories and 2,728 odd-one-out items. Items unsolvable by all models are excluded per the evaluation methodology (1,330 analogies and 2,223 odd-one-out items are solvable).

Model	Analogy Rank 1	Analogy Rank 5	Odd-One-Out
FastText CBOW-300-10	84.5%	96.6%	73.6%
Floret v3.9 (lg)	81.4%	95.3%	74.0%
Word2Vec CBOW-300-10	70.2%	91.3%	79.1%
GloVe 300	49.5%	79.2%	75.1%

FastText leads on analogy resolution due to subword information that captures Latin morphology. Word2Vec leads on odd-one-out (semantic clustering). GloVe is weaker on analogies because it lacks subword representations. Floret is used in LatinCy spaCy pipelines because it is 6x smaller than FastText while remaining competitive, and supports arbitrary vocabulary via hash-based lookups.

For full evaluation details including per-category breakdowns and nearest-neighbor spot checks, see the evaluation report.

Usage

From HuggingFace Hub

from huggingface_hub import hf_hub_download

# FastText binary model
path = hf_hub_download("latincy/la_vectors", "fasttext/la_fasttext_cbow_300_10.bin")

# Word2Vec text vectors
path = hf_hub_download("latincy/la_vectors", "word2vec/la_w2v_cbow_300_10.txt")

# GloVe vectors
path = hf_hub_download("latincy/la_vectors", "glove/la_glove_300.txt")

Floret (spaCy)

import spacy

nlp = spacy.load("la_vectors_floret_lg")
doc = nlp("rex populum regit")
for token in doc:
    print(token.text, token.has_vector, token.vector[:5])

Training Corpus

All vectors are trained on the same corpus for valid cross-method comparison.

Source	Sentences	Tokens
CC100-Latin	6,507,840	128,886,505
Latin Wikisource	3,933,289	76,736,695
Latin Wikipedia	972,336	15,218,700
CAMENA Neo-Latin	736,400	9,970,933
The Latin Library	650,082	12,822,687
CLTK-Tesserae	516,930	6,626,484
Perseus Digital Library	223,535	4,317,063
Patrologia Latina	125,333	10,399,108
UD Latin treebanks (6)	55,332	980,787
Total	13,721,077	265,958,962

Citation

@misc{burns2023latincy,
    title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
    author = "Burns, Patrick J.",
    year = "2023",
    eprint = "2305.04365",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.04365"
}

References

Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. "Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin." In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf.

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Downloads last month: 47

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for latincy/la_vectors

LatinCy: Synthetic Trained Pipelines for Latin NLP

Paper • 2305.04365 • Published May 7, 2023

Evaluation results

FastText CBOW-300-10 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
self-reported

84.500
Floret v3.9 (lg) Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
self-reported

81.400
Word2Vec CBOW-300-10 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
self-reported

70.200
GloVe 300 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
self-reported

49.500
Word2Vec CBOW-300-10 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
self-reported

79.100
GloVe 300 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
self-reported

75.100
Floret v3.9 (lg) on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
self-reported

74.000
FastText CBOW-300-10 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
self-reported

73.600