BryanFalkowski
/

english-to-latin-v2

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

BryanFalkowski commited on Aug 5, 2023

Commit

ee7af52

•

1 Parent(s): e268787

Create README.md

Files changed (1) hide show

README.md +63 -0

README.md ADDED Viewed

	@@ -0,0 +1,63 @@

+---
+language:
+  - "la"
+---
+### Initial model for english to latin translations which is still being trained.
+This model is designed to execute Latin-to-English translations, built using the extensive CCMatrix dataset. The CCMatrix dataset is a vast compilation of high-quality parallel sentences drawn from the public CommonCrawl dataset, consisting of over 4.5 billion sentence pairs across 576 language pairs. The model is devised to harness the power of this substantial corpus, aiming to provide an effective and precise solution for Latin translation tasks.
+Nevertheless, the training dataset's literary range spans numerous centuries, thereby introducing the model to the Latin language's significant evolution over these eras. Consequently, the model encounters different expressions of the same concept, potentially including equivalent sentences in both vulgar and classical Latin. This is likely the reason behind the model's oscillating loss.
+## Current state:
+- {'loss': 0.8056, 'learning_rate': 6.482837857245441e-06, 'epoch': 20.28}
+- {'loss': 1.253, 'learning_rate': 6.48092297381397e-06, 'epoch': 20.28}
+- {'loss': 1.2961, 'learning_rate': 6.4790080903824985e-06, 'epoch': 20.28}
+- {'loss': 1.3402, 'learning_rate': 6.477093206951027e-06, 'epoch': 20.28}
+- {'loss': 0.9309, 'learning_rate': 6.475178323519556e-06, 'epoch': 20.29}
+- {'loss': 0.7945, 'learning_rate': 6.473263440088085e-06, 'epoch': 20.29}
+- {'loss': 0.9205, 'learning_rate': 6.471348556656614e-06, 'epoch': 20.29}
+- {'loss': 1.4583, 'learning_rate': 6.228158360859783e-06, 'epoch': 20.66}
+....still running.....
+fine-tuned using the IPUSeq2SeqTrainer API on the facebook/bart-base model
+BartTokenizerFast tokenizer
+## Dataset Description
+- Homepage: https://opus.nlpl.eu/CCMatrix.php
+- Sample: https://opus.nlpl.eu/CCMatrix/v1/en-la_sample.html
+- Paper: https://arxiv.org/abs/1911.04944
+-
+The latin dataset contans: - 1,114,190 Sentence pairs - 14.5 M words
+### Data Format
+```
+{
+        "id": 1,
+        "score": 1.2498379,
+        "translation": {
+            "en": "No telling what sort of magic he might have.\""
+            "la": "numque magistrâtum cum iis habent.\
+        },
+        "id": 2,
+        "score": 1.1443379,
+        "translation": {
+            "en": "Not many, but much.\""
+            "la": "non multa sed multum.\
+        }
+    }
+```
+For training, the dataset was divided as follows: DatasetDict
+- train: num_rows: 891352
+- validation:  num_rows: 111419
+- test:  num_rows: 111419