LatinCy Vectors

Static word vectors for Latin, trained on the LatinCy corpus. Provides Floret, FastText, Word2Vec, and GloVe embeddings trained on the same data and evaluated on the same benchmarks for direct cross-method comparison.

Available Models

All models are trained with CBOW architecture, 300 dimensions, window size 10, min count 50, 15 epochs, negative sampling 25 on the full LatinCy corpus (13.7M sentences, ~266M tokens from 9 sources).

Model Type Vocab HF Repo
Floret (lg) Hash-based subword 200k buckets latincy/la_vectors_floret_lg
Floret (md) Hash-based subword 50k buckets latincy/la_vectors_floret_md
FastText CBOW-300-10 Subword (n-gram) 233k words latincy/la_vectors
Word2Vec CBOW-300-10 Word-level 233k words latincy/la_vectors
GloVe 300 Word-level (co-occurrence) 233k words latincy/la_vectors

Floret vectors are distributed separately as spaCy pipeline components. FastText, Word2Vec, and GloVe are in a single umbrella repo (latincy/la_vectors).

Evaluation

Evaluated on curated Latin benchmarks: 1,383 analogy items across 11 categories and 2,728 odd-one-out items. Items unsolvable by all models are excluded per the evaluation methodology (1,330 analogies and 2,223 odd-one-out items are solvable).

Model Analogy Rank 1 Analogy Rank 5 Odd-One-Out
FastText CBOW-300-10 84.5% 96.6% 73.6%
Floret v3.9 (lg) 81.4% 95.3% 74.0%
Word2Vec CBOW-300-10 70.2% 91.3% 79.1%
GloVe 300 49.5% 79.2% 75.1%

FastText leads on analogy resolution due to subword information that captures Latin morphology. Word2Vec leads on odd-one-out (semantic clustering). GloVe is weaker on analogies because it lacks subword representations. Floret is used in LatinCy spaCy pipelines because it is 6x smaller than FastText while remaining competitive, and supports arbitrary vocabulary via hash-based lookups.

For full evaluation details including per-category breakdowns and nearest-neighbor spot checks, see the evaluation report.

Usage

From HuggingFace Hub

from huggingface_hub import hf_hub_download

# FastText binary model
path = hf_hub_download("latincy/la_vectors", "fasttext/la_fasttext_cbow_300_10.bin")

# Word2Vec text vectors
path = hf_hub_download("latincy/la_vectors", "word2vec/la_w2v_cbow_300_10.txt")

# GloVe vectors
path = hf_hub_download("latincy/la_vectors", "glove/la_glove_300.txt")

Floret (spaCy)

import spacy

nlp = spacy.load("la_vectors_floret_lg")
doc = nlp("rex populum regit")
for token in doc:
    print(token.text, token.has_vector, token.vector[:5])

Training Corpus

All vectors are trained on the same corpus for valid cross-method comparison.

Source Sentences Tokens
CC100-Latin 6,507,840 128,886,505
Latin Wikisource 3,933,289 76,736,695
Latin Wikipedia 972,336 15,218,700
CAMENA Neo-Latin 736,400 9,970,933
The Latin Library 650,082 12,822,687
CLTK-Tesserae 516,930 6,626,484
Perseus Digital Library 223,535 4,317,063
Patrologia Latina 125,333 10,399,108
UD Latin treebanks (6) 55,332 980,787
Total 13,721,077 265,958,962

Citation

@misc{burns2023latincy,
    title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
    author = "Burns, Patrick J.",
    year = "2023",
    eprint = "2305.04365",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.04365"
}

References

  • Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. "Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin." In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf.

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Downloads last month
47
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for latincy/la_vectors

Evaluation results

  • FastText CBOW-300-10 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
    self-reported
    84.500
  • Floret v3.9 (lg) Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
    self-reported
    81.400
  • Word2Vec CBOW-300-10 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
    self-reported
    70.200
  • GloVe 300 Rank 1 on LatinCy Analogies (1,330 solvable / 1,383 total)
    self-reported
    49.500
  • Word2Vec CBOW-300-10 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
    self-reported
    79.100
  • GloVe 300 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
    self-reported
    75.100
  • Floret v3.9 (lg) on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
    self-reported
    74.000
  • FastText CBOW-300-10 on LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
    self-reported
    73.600