Logion: Machine Learning for Greek Philology

(for the most recent model, see: https://huggingface.co/cabrooks/LOGION-50k_wordpiece)

Read the paper on arxiv by Charlie Cowen-Breen, Creston Brooks, Johannes Haubold, and Barbara Graziosi.

Originally based on the pre-trained weights and tokenizer made available by Pranaydeep Singh's Ancient Greek BERT, we train on a corpus of over 70 million words of premodern Greek.

Further information on this project and code for beam-searching over multiple masked tokens can be found on GitHub.

We're adding more models trained with cleaner data and different tokenizations - keep an eye out!

How to use

Requirements:

pip install transformers

Load the model and tokenizer directly from the HuggingFace Model Hub:

from transformers import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained("cabrooks/LOGION-base")
model = BertForMaskedLM.from_pretrained("cabrooks/LOGION-base")

Model pre-training and tokenizer

The model was initialized from Pranaydeep Singh's Ancient Greek BERT, which itself used a Modern Greek BERT as pre-training. Singh's Ancient Greek BERT was trained on data pulled from First1KGreek Project, Perseus Digital Library, PROIEL Treebank, and Gorman's Treebank. We train futher on over 70 million words of premodern Greek, which we are happy to make available upon request. For more information, please see footnote 2 on the arxiv paper. Please also refer to this paper for details on training and evaluation.

Cite

If you use this model in your research, please cite the paper:

@misc{logion-base,
      title={Logion: Machine Learning for Greek Philology}, 
      author={Cowen-Breen, C. and Brooks, C. and Haubold, J. and Graziosi, B.},
      year={2023},
      eprint={2305.01099},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}