Aleksi Sahala
init model
fc29d6d

First Millennium Babylonian model for BabyLemmatizer

Total data set size ca. 1.3M words (including lacunae). Consists of all Oracc texts labeled as any variant of Babylonian or Akkadian in the first millennium BCE. Neo-Assyrian excluded. OOV rate is fairly low but the data set is very varied and comprises all different text genres.

See model Babylonian-2nd for Middle Babylonian (and in general second millennium Babylonian).

Evaluation results

Neural Net Evaluation
COMPONENT       AVG     CI       MODEL0
POS-tagger      96.84   ±0.00    96.84
Lemmatizer      95.23   ±0.00    95.23
Combined        93.91   ±0.00    93.91
POS-tagger OOV  87.41   ±0.00    87.41
Lemmatizer OOV  71.78   ±0.00    71.78
Combined   OOV  69.63   ±0.00    69.63
-----------------------------------------------
OOV input rate  6.63             6.63

Post-correct Evaluation
COMPONENT       AVG     CI       MODEL0
POS-tagger      96.84   ±0.00    96.84
Lemmatizer      95.36   ±0.00    95.36
Combined        94.04   ±0.00    94.04
POS-tagger OOV  87.41   ±0.00    87.41
Lemmatizer OOV  71.78   ±0.00    71.78
Combined   OOV  69.63   ±0.00    69.63
-----------------------------------------------
OOV input rate  6.63             6.63