license: mit
language:
- la
pipeline_tag: fill-mask
tags:
- latin
- masked language modelling
widget:
- text: Gallia est omnis divisa in [MASK] tres .
example_title: Commentary on Gallic Wars
- text: '[MASK] sum Caesar .'
example_title: Who is Caesar?
- text: '[MASK] it ad forum .'
example_title: Who is going to the forum?
- text: Ovidius paratus est ad [MASK] .
example_title: What is Ovidius up to?
- text: '[MASK], veni!'
example_title: Calling someone to come closer
- text: Roma in Italia [MASK] .
example_title: Ubi est Roma?
Model Card for Simple Latin BERT
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the Classical Language Toolkit corpora.
NOT apt for production nor commercial use.
This model's performance is really poor, and it has not been evaluated.
This model comes with its own tokenizer! It will automatically use lowercase.
Check the training notebooks
folder for the preprocessing and training scripts.
Inspired by
- This repo, which has a BERT model for latin that is actually useful!
- This tutorial
- This tutorial
- This tutorial
Table of Contents
- Model Card for Simple Latin BERT
- Table of Contents
- Table of Contents
- Model Details
- Uses
- Training Details
- Evaluation
Model Details
Model Description
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the Classical Language Toolkit corpora.
NOT apt for production nor commercial use.
This model's performance is really poor, and it has not been evaluated.
This model comes with its own tokenizer!
Check the notebooks
folder for the preprocessing and training scripts.
- Developed by: Luis Antonio VASQUEZ
- Model type: Language model
- Language(s) (NLP): la
- License: mit
Uses
Direct Use
This model can be used directly for Masked Language Modelling.
Downstream Use
This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers' BertForSequenceClassification
)
Training Details
Training Data
The training data comes from the corpora freely available from the Classical Language Toolkit
- The Latin Library
- Latin section of the Perseus Digital Library
- Latin section of the Tesserae Project
- Corpus Grammaticorum Latinorum
Training Procedure
Preprocessing
For preprocessing, the raw text from each of the corpora was extracted by parsing. Then, it was lowercased and written onto txt
files. Ideally, in these files one line would correspond to one sentence.
Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded.
Training hyperparameters:
- epochs: 1
- Batch size: 64
- Attention heads: 12
- Hidden Layers: 12
- Max input size: 512 tokens
Speeds, Sizes, Times
After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours.
Evaluation
No evaluation was performed on this dataset.