adorkin commited on
Commit
2ec9a8f
1 Parent(s): 33cb8c0

Add description, article, and examples as separate file

Browse files
Files changed (3) hide show
  1. article.md +33 -0
  2. description.txt +3 -0
  3. examples.tsv +7 -0
article.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Description
2
+
3
+ The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
4
+ a rule-based analyzer (a `lexicon` — Vabamorf morphological analyzer in this particular case) as an additional input to
5
+ improve the results of a neural lemmatization model. Said additional input is a concatenation of one or more
6
+ lemma candidates provided by Vabamorf. A second encoder is used to process this input. See the scheme below.
7
+
8
+ The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
9
+ morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
10
+ morphological features and the part of speech are provided by a more recent version of Stanza, although it's possible
11
+ to use native Vabamorf features as well (the results, however, are going to be slightly worse). Additional lexicon
12
+ input is processed by a separate encoder.
13
+
14
+ <p align="center">
15
+ <img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
16
+ </p>
17
+
18
+ The models were trained on version 2.7 of the Estonian Dependency Treebank.
19
+
20
+ Two variants of lemmatization are provided in the demo: regular lemmatization and lemmatization with
21
+ special symbols. Special symbols are `=` and `_`, denoting morphological derivation and separating parts of
22
+ compound words respectively. The latter was trained on the original data with Vabamorf set to output
23
+ these special symbols, while the latter was trained with `=` and `_` removed from the data and
24
+ vabamorf output. See the results on `dev` set in the table below (models trained on vabamorf features are
25
+ not included in the demo).
26
+
27
+ | Model | Token-wise accuracy |
28
+ |---------------------------------------|---------------------|
29
+ | Stanza features | 98.13 |
30
+ | Stanza features and special symbols | 97.28 |
31
+ | Vabamorf features | 97.32 |
32
+ | Vabamorf features and special symbols | 96.34 |
33
+
description.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
2
+ developed by TartuNLP research group. Options to lemmatize with or without special symbols are available (do note that
3
+ this functionality is implemented by means of using two separate models). For more details see Description below.
examples.tsv ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Ekspositsioonid võiksid alata juba kunstihotellide fuajeedest. 0
2
+ Ekspositsioonid võiksid alata juba kunstihotellide fuajeedest. 1
3
+ Kõik uuritavad võeti vastu TÜ üld- ja molekulaarpatoloogia instituudis inimesegeneetika uurimisrühmas. 1
4
+ Peamiselt viimasele toetub ka järgnev artikkel. 0
5
+ Arutletakse selle üle, mida ülearuse rahaga peale hakata. 0
6
+ Väikesele poisile tuuakse apteegist söögiisu tõstmiseks kalamaksaõli. 0
7
+ Tulevased beebid olid justkui peegeldusena pilgu beebisinas ja veel mingi ähmane lubadus. 0