A Latin Language Model, trained on classical Latin texts that are reasonably close to Cicero's range of vocabulary as described in the forthcoming paper "What Would Cicero Write?".
Normalize text using JV Replacement and tokenize using CLTK to separate enclitics such as "-que", then:
from transformers import BertForMaskedLM, AutoTokenizer, FillMaskPipeline tokenizer = AutoTokenizer.from_pretrained("cook/cicero-similis") model = BertForMaskedLM.from_pretrained("cook/cicero-similis") fill_mask = FillMaskPipeline(model=model, tokenizer=tokenizer) # Cicero, De Re Publica, VI, 32, 2 # "animal" is found in A, Q, PhD manuscripts # 'anima' H^1 Macr. et codd. Tusc. results = fill_mask("inanimum est enim omne quod pulsu agitatur externo; quod autem est [MASK],")
Biased towards Cicero, but that weakness is the model's strength; it's not aimed to be a one-size fits all model.
Trained on the corpora Phi5, Tesserae, and Thomas Aquinas--excluding documents that went outside the scope of Cicero's expected unknown vocabulary probabilities.
5 epochs, masked language modeling .45, effective batch size 32
A novel evaluation metric is proposed in the forthcoming paper "What Would Cicero Write?"
A paper will be published in Cicero Digitalis in 2021.
Select AutoNLP in the “Train” menu to fine-tune this model automatically.
- Downloads last month