A model for translating cuneiform to english using google's t5-small as a baseline.

Akkadian: 𒄿 𒈾 𒌗 𒃶 𒌓 𒐉 𒆚 𒀀 𒈾 𒆳 𒆸 𒄭 𒇻 𒁺 𒅅 𒆳 𒁀 𒀀 𒍝 𒆳 𒊓 𒅈 𒁀 𒇷 𒀀 𒆳 𒁲 𒁺 𒀀 𒆷 𒀀 𒁲 𒌷 𒈨 𒌍 𒉌 𒃻 𒅆 𒁲 𒀀 𒇉 𒊒 𒌑 𒊒 𒊭 𒆳 𒈨 𒄴 𒊑 𒀝 𒋤 𒊩 𒆷 𒋢 𒉡 𒃻 𒋗 𒈨 𒌍 𒋗 𒉡 𒌑 𒊺 𒍝 𒀀 𒀀 𒈾 𒌷 𒅀 𒀸 𒋩 𒌒 𒆷'
English: 'in the month kislimu the fourth day i marched to the land habhu i conquered the lands bazu sarbaliu and didualu together with the cities on the banks of the river ruru of the land mehru i brought forth their booty and possessions and brought them to my city assur' Prediction: 'in the mo nth tammuz iv i conquered the land s que and que i conquered the land s que and bi t yakin i conquered the cities f ro m the river i conquered and plundered the cities on the bo rd er of the land elam'

Note that the training loss does not reflect full training - this model was trained at expanding context sizes (56 -> 512) restricted to complete sequences. It was trained on cuneiform -> English, transliteration, and grouping in both directions to reinforce itself. It is an instruct model, so it requires a request to intepret data.

akk-111m

This model was trained from scratch on the Akkademia dataset. It achieves the following categorical cross-entropy results on the evaluation set (512 tokens):

Loss: 0.0753

Cuneiform -> English Bleu score

500 tokens: 38.91
100 tokens: 43.13

Transliterated -> English Bleu score

500 tokens: 37.02
100 tokens: 41.67

Cuneiform -> Transliteration Bleu score

500 tokens: 94.31
100 tokens: 94.36

Cuneiform -> Transliteration Accuracy

100 tokens: 50% (note a single missed character significantly decreases accuracy in seq2seq models, see Bleu score for positional flexibility)

Model description

This is an instruct model, meaning it is capable of multiple tasks. It is intended for primarily translation + transliteration, but it can also be used for reverse translation as well.

###Translation Instrutions:

"Translate Akkadian cuneiform to English" + cuneiform signs -> English
"Translate Akkadian simple transliteration to English" + simple transliteration -> English
"Translate Akkadian grouped transliteration to English" + transliteration with spacial symbols -> English
"Translate English to Akkadian cuneiform" + English -> Akkadian cuneiform signs
"Translate English to simple Akkadian transliteration" + English -> Akkadian simple transliteration with no special symbols
"Translate English to grouped Akkadian transliteration" + English -> Akkadian transliteration grouped into words with special symbols

###Transliteration Instructions:

"Transliterate Akkadian cuneiform to simple Latin Characters" + cuneiform signs -> transliteration with no special symbols
"Transliterate Akkadian cuneiform to grouped Latin characters" + cuneiform signs -> transliteration with special symbols/subscripts
"Group Akkadian transliteration into likely words" + simple transliteration -> transliteration with special symbols/subscripts

Intended uses & limitations

This model is designed to facilitate the translation/transliteration of Akkadian cuneiform. It may have limited facility in the reverse (e.g. translate English to Akkadian cuneiform) but these use cases are untested.

Training and evaluation data

Data was used from the Akkademia project, previously published in PNAS Nexus. More information on the training data, as well as the test and validation splits, can be found on both the GitHub and published methodology.

Training procedure

Because of the unequal distribution of data (many short sequences + long sequences) data was trained with different padded lengths: An initial few epochs with a max length of 56 tokens A follow-up series of epochs at 128 tokens The same for 256 tokens A final 5 epochs for 512 tokens

The origional t5-small model had its tokens and embedding layers expanded by the additional linguistic data. Cuneiform symbols were split by spaces to be fed directly into the model, following the instructions detailed above.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 4e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 30

Framework versions

Transformers 4.40.1
Pytorch 2.5.0.dev20240627
Datasets 2.14.0
Tokenizers 0.19.1

Thalesian
/

akk-111m