BryanFalkowski commited on
Commit
ee7af52
1 Parent(s): e268787

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "la"
4
+
5
+
6
+ ---
7
+
8
+ ### Initial model for english to latin translations which is still being trained.
9
+
10
+ This model is designed to execute Latin-to-English translations, built using the extensive CCMatrix dataset. The CCMatrix dataset is a vast compilation of high-quality parallel sentences drawn from the public CommonCrawl dataset, consisting of over 4.5 billion sentence pairs across 576 language pairs. The model is devised to harness the power of this substantial corpus, aiming to provide an effective and precise solution for Latin translation tasks.
11
+
12
+ Nevertheless, the training dataset's literary range spans numerous centuries, thereby introducing the model to the Latin language's significant evolution over these eras. Consequently, the model encounters different expressions of the same concept, potentially including equivalent sentences in both vulgar and classical Latin. This is likely the reason behind the model's oscillating loss.
13
+
14
+ ## Current state:
15
+
16
+ - {'loss': 0.8056, 'learning_rate': 6.482837857245441e-06, 'epoch': 20.28}
17
+ - {'loss': 1.253, 'learning_rate': 6.48092297381397e-06, 'epoch': 20.28}
18
+ - {'loss': 1.2961, 'learning_rate': 6.4790080903824985e-06, 'epoch': 20.28}
19
+ - {'loss': 1.3402, 'learning_rate': 6.477093206951027e-06, 'epoch': 20.28}
20
+ - {'loss': 0.9309, 'learning_rate': 6.475178323519556e-06, 'epoch': 20.29}
21
+ - {'loss': 0.7945, 'learning_rate': 6.473263440088085e-06, 'epoch': 20.29}
22
+ - {'loss': 0.9205, 'learning_rate': 6.471348556656614e-06, 'epoch': 20.29}
23
+ - {'loss': 1.4583, 'learning_rate': 6.228158360859783e-06, 'epoch': 20.66}
24
+
25
+
26
+
27
+ ....still running.....
28
+
29
+ fine-tuned using the IPUSeq2SeqTrainer API on the facebook/bart-base model
30
+
31
+ BartTokenizerFast tokenizer
32
+
33
+ ## Dataset Description
34
+ - Homepage: https://opus.nlpl.eu/CCMatrix.php
35
+ - Sample: https://opus.nlpl.eu/CCMatrix/v1/en-la_sample.html
36
+ - Paper: https://arxiv.org/abs/1911.04944
37
+ -
38
+ The latin dataset contans: - 1,114,190 Sentence pairs - 14.5 M words
39
+
40
+ ### Data Format
41
+ ```
42
+ {
43
+ "id": 1,
44
+ "score": 1.2498379,
45
+ "translation": {
46
+ "en": "No telling what sort of magic he might have.\""
47
+ "la": "numque magistrâtum cum iis habent.\
48
+ },
49
+ "id": 2,
50
+ "score": 1.1443379,
51
+ "translation": {
52
+ "en": "Not many, but much.\""
53
+ "la": "non multa sed multum.\
54
+ }
55
+ }
56
+ ```
57
+
58
+
59
+ For training, the dataset was divided as follows: DatasetDict
60
+ - train: num_rows: 891352
61
+ - validation: num_rows: 111419
62
+ - test: num_rows: 111419
63
+