--- language: - la tags: - language model license: apache-2.0 datasets: - Tesserae - Phi5 - Thomas Aquinas - Patrologia Latina --- # Cicero-Similis ## Model description A Latin Language Model, trained on Latin texts, and evaluated using the corpus of Cicero, as described in the paper _What Would Cicero Write? -- Examining Critical Textual Decisions with a Language Model_ by Todd Cook, Published in Ciceroniana On Line, Vol. V, #2. ## Intended uses & limitations #### How to use Normalize text using JV Replacement and tokenize using CLTK to separate enclitics such as "-que", then: ``` from transformers import BertForMaskedLM, AutoTokenizer, FillMaskPipeline tokenizer = AutoTokenizer.from_pretrained("cook/cicero-similis") model = BertForMaskedLM.from_pretrained("cook/cicero-similis") fill_mask = FillMaskPipeline(model=model, tokenizer=tokenizer, top_k=10_000) # Cicero, De Re Publica, VI, 32, 2 # "animal" is found in A, Q, PhD manuscripts # 'anima' H^1 Macr. et codd. Tusc. results = fill_mask("inanimum est enim omne quod pulsu agitatur externo; quod autem est [MASK],") ``` #### Limitations and bias Currently the model training data excludes modern and 19th century texts, but that weakness is the model's strength; it's not aimed to be a one-size-fits-all model. ## Training data Trained on the corpora Phi5, Tesserae, Thomas Aquinas, and Patrologes Latina. ## Training procedure 5 epochs, masked language modeling .15, effective batch size 32 ## Eval results A novel evaluation metric is proposed in the paper _What Would Cicero Write? -- Examining Critical Textual Decisions with a Language Model_ by Todd Cook, Published in Ciceroniana On Line, Vol. V, #2. ### BibTeX entry and citation info TODO _What Would Cicero Write? -- Examining Critical Textual Decisions with a Language Model_ by Todd Cook, Published in Ciceroniana On Line, Vol. V, #2.