Spaces:
Runtime error
Runtime error
Add description, article, and examples as separate file
Browse files- article.md +33 -0
- description.txt +3 -0
- examples.tsv +7 -0
article.md
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Description
|
2 |
+
|
3 |
+
The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
|
4 |
+
a rule-based analyzer (a `lexicon` — Vabamorf morphological analyzer in this particular case) as an additional input to
|
5 |
+
improve the results of a neural lemmatization model. Said additional input is a concatenation of one or more
|
6 |
+
lemma candidates provided by Vabamorf. A second encoder is used to process this input. See the scheme below.
|
7 |
+
|
8 |
+
The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
|
9 |
+
morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
|
10 |
+
morphological features and the part of speech are provided by a more recent version of Stanza, although it's possible
|
11 |
+
to use native Vabamorf features as well (the results, however, are going to be slightly worse). Additional lexicon
|
12 |
+
input is processed by a separate encoder.
|
13 |
+
|
14 |
+
<p align="center">
|
15 |
+
<img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
|
16 |
+
</p>
|
17 |
+
|
18 |
+
The models were trained on version 2.7 of the Estonian Dependency Treebank.
|
19 |
+
|
20 |
+
Two variants of lemmatization are provided in the demo: regular lemmatization and lemmatization with
|
21 |
+
special symbols. Special symbols are `=` and `_`, denoting morphological derivation and separating parts of
|
22 |
+
compound words respectively. The latter was trained on the original data with Vabamorf set to output
|
23 |
+
these special symbols, while the latter was trained with `=` and `_` removed from the data and
|
24 |
+
vabamorf output. See the results on `dev` set in the table below (models trained on vabamorf features are
|
25 |
+
not included in the demo).
|
26 |
+
|
27 |
+
| Model | Token-wise accuracy |
|
28 |
+
|---------------------------------------|---------------------|
|
29 |
+
| Stanza features | 98.13 |
|
30 |
+
| Stanza features and special symbols | 97.28 |
|
31 |
+
| Vabamorf features | 97.32 |
|
32 |
+
| Vabamorf features and special symbols | 96.34 |
|
33 |
+
|
description.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
|
2 |
+
developed by TartuNLP research group. Options to lemmatize with or without special symbols are available (do note that
|
3 |
+
this functionality is implemented by means of using two separate models). For more details see Description below.
|
examples.tsv
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Ekspositsioonid võiksid alata juba kunstihotellide fuajeedest. 0
|
2 |
+
Ekspositsioonid võiksid alata juba kunstihotellide fuajeedest. 1
|
3 |
+
Kõik uuritavad võeti vastu TÜ üld- ja molekulaarpatoloogia instituudis inimesegeneetika uurimisrühmas. 1
|
4 |
+
Peamiselt viimasele toetub ka järgnev artikkel. 0
|
5 |
+
Arutletakse selle üle, mida ülearuse rahaga peale hakata. 0
|
6 |
+
Väikesele poisile tuuakse apteegist söögiisu tõstmiseks kalamaksaõli. 0
|
7 |
+
Tulevased beebid olid justkui peegeldusena pilgu beebisinas ja veel mingi ähmane lubadus. 0
|