Model description

A Latin Language Model, trained on classical Latin texts that are reasonably close to Cicero's range of vocabulary as described in the forthcoming paper "What Would Cicero Write?".

Intended uses & limitations

How to use

Normalize text using JV Replacement and tokenize using CLTK to separate enclitics such as "-que", then:

from transformers import BertForMaskedLM, AutoTokenizer, FillMaskPipeline
tokenizer = AutoTokenizer.from_pretrained("cook/cicero-similis")
model = BertForMaskedLM.from_pretrained("cook/cicero-similis")
fill_mask = FillMaskPipeline(model=model, tokenizer=tokenizer)
# Cicero, De Re Publica, VI, 32, 2
# "animal" is found in A, Q, PhD manuscripts
# 'anima' H^1 Macr. et codd. Tusc.
results = fill_mask("inanimum est enim omne quod pulsu agitatur externo; quod autem est [MASK],")

Limitations and bias

Biased towards Cicero, but that weakness is the model's strength; it's not aimed to be a one-size fits all model.

Training data

Trained on the corpora Phi5, Tesserae, and Thomas Aquinas--excluding documents that went outside the scope of Cicero's expected unknown vocabulary probabilities.

Training procedure

5 epochs, masked language modeling .45, effective batch size 32

Eval results

A novel evaluation metric is proposed in the forthcoming paper "What Would Cicero Write?"

BibTeX entry and citation info

A paper will be published in Cicero Digitalis in 2021.


Select AutoNLP in the “Train” menu to fine-tune this model automatically.

Downloads last month
Hosted inference API
Mask token: [MASK]
This model can be loaded on the Inference API on-demand.