First Millennium Babylonian model for BabyLemmatizer
Total data set size ca. 1.3M words (including lacunae). Consists of all Oracc texts labeled as any variant of Babylonian or Akkadian in the first millennium BCE. Neo-Assyrian excluded. OOV rate is fairly low but the data set is very varied and comprises all different text genres.
See model Babylonian-2nd for Middle Babylonian (and in general second millennium Babylonian).
Evaluation results
Neural Net Evaluation
COMPONENT AVG CI MODEL0
POS-tagger 96.84 ±0.00 96.84
Lemmatizer 95.23 ±0.00 95.23
Combined 93.91 ±0.00 93.91
POS-tagger OOV 87.41 ±0.00 87.41
Lemmatizer OOV 71.78 ±0.00 71.78
Combined OOV 69.63 ±0.00 69.63
-----------------------------------------------
OOV input rate 6.63 6.63
Post-correct Evaluation
COMPONENT AVG CI MODEL0
POS-tagger 96.84 ±0.00 96.84
Lemmatizer 95.36 ±0.00 95.36
Combined 94.04 ±0.00 94.04
POS-tagger OOV 87.41 ±0.00 87.41
Lemmatizer OOV 71.78 ±0.00 71.78
Combined OOV 69.63 ±0.00 69.63
-----------------------------------------------
OOV input rate 6.63 6.63