BryanFalkowski/english-to-latin-v2

Initial model for english to latin translations which is still being trained.

This model is designed to execute Latin-to-English translations, built using the extensive CCMatrix dataset. The CCMatrix dataset is a vast compilation of high-quality parallel sentences drawn from the public CommonCrawl dataset, consisting of over 4.5 billion sentence pairs across 576 language pairs. The model is devised to harness the power of this substantial corpus, aiming to provide an effective and precise solution for Latin translation tasks.

Nevertheless, the training dataset's literary range spans numerous centuries, thereby introducing the model to the Latin language's significant evolution over these eras. Consequently, the model encounters different expressions of the same concept, potentially including equivalent sentences in both vulgar and classical Latin. This is likely the reason behind the model's oscillating loss.

Current state:

{'loss': 0.8056, 'learning_rate': 6.482837857245441e-06, 'epoch': 20.28}
{'loss': 1.253, 'learning_rate': 6.48092297381397e-06, 'epoch': 20.28}
{'loss': 1.2961, 'learning_rate': 6.4790080903824985e-06, 'epoch': 20.28}
{'loss': 1.3402, 'learning_rate': 6.477093206951027e-06, 'epoch': 20.28}
{'loss': 0.9309, 'learning_rate': 6.475178323519556e-06, 'epoch': 20.29}
{'loss': 0.7945, 'learning_rate': 6.473263440088085e-06, 'epoch': 20.29}
{'loss': 0.9205, 'learning_rate': 6.471348556656614e-06, 'epoch': 20.29}
{'loss': 1.4583, 'learning_rate': 6.228158360859783e-06, 'epoch': 20.66}

....still running.....

fine-tuned using the IPUSeq2SeqTrainer API on the facebook/bart-base model

BartTokenizerFast tokenizer

Dataset Description

Homepage: https://opus.nlpl.eu/CCMatrix.php
Sample: https://opus.nlpl.eu/CCMatrix/v1/en-la_sample.html
Paper: https://arxiv.org/abs/1911.04944
The latin dataset contans: - 1,114,190 Sentence pairs - 14.5 M words

Data Format

{
        "id": 1,
        "score": 1.2498379,
        "translation": {
            "en": "No telling what sort of magic he might have.\""
            "la": "numque magistrâtum cum iis habent.\
        },
        "id": 2,
        "score": 1.1443379,
        "translation": {
            "en": "Not many, but much.\""
            "la": "non multa sed multum.\
        }
    }

For training, the dataset was divided as follows: DatasetDict

train: num_rows: 891352
validation: num_rows: 111419
test: num_rows: 111419