Spaces:

tartuNLP
/

LexiconEnhancedLemmatization

Runtime error

App Files Files Community

adorkin commited on May 27, 2022

Commit

2ec9a8f

•

1 Parent(s): 33cb8c0

Add description, article, and examples as separate file

Browse files

Files changed (3) hide show

article.md +33 -0
description.txt +3 -0
examples.tsv +7 -0

article.md ADDED Viewed

	@@ -0,0 +1,33 @@

+## Description
+The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
+a rule-based analyzer (a `lexicon` — Vabamorf morphological analyzer in this particular case) as an additional input to
+improve the results of a neural lemmatization model. Said additional input is a concatenation of one or more
+lemma candidates provided by Vabamorf. A second encoder is used to process this input. See the scheme below.
+The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
+morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
+morphological features and the part of speech are provided by a more recent version of Stanza, although it's possible
+to use native Vabamorf features as well (the results, however, are going to be slightly worse). Additional lexicon
+input is processed by a separate encoder.
+<p align="center">
+    <img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
+</p>
+The models were trained on version 2.7 of the Estonian Dependency Treebank.
+Two variants of lemmatization are provided in the demo: regular lemmatization and lemmatization with
+special symbols. Special symbols are `=` and `_`, denoting morphological derivation and separating parts of
+compound words respectively. The latter was trained on the original data with Vabamorf set to output
+these special symbols, while the latter was trained with `=` and `_` removed from the data and
+vabamorf output. See the results on `dev` set in the table below (models trained on vabamorf features are
+not included in the demo).
+| Model                                 | Token-wise accuracy |
+|---------------------------------------|---------------------|
+| Stanza features                       | 98.13               |
+| Stanza features and special symbols   | 97.28               |
+| Vabamorf features                     | 97.32               |
+| Vabamorf features and special symbols | 96.34               |

description.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
+developed by TartuNLP research group. Options to lemmatize with or without special symbols are available (do note that
+this functionality is implemented by means of using two separate models). For more details see Description below.

examples.tsv ADDED Viewed

	@@ -0,0 +1,7 @@

+Ekspositsioonid võiksid alata juba kunstihotellide fuajeedest.	0
+Ekspositsioonid võiksid alata juba kunstihotellide fuajeedest.	1
+Kõik uuritavad võeti vastu TÜ üld- ja molekulaarpatoloogia instituudis inimesegeneetika uurimisrühmas.	1
+Peamiselt viimasele toetub ka järgnev artikkel.	0
+Arutletakse selle üle, mida ülearuse rahaga peale hakata.	0
+Väikesele poisile tuuakse apteegist söögiisu tõstmiseks kalamaksaõli.	0
+Tulevased beebid olid justkui peegeldusena pilgu beebisinas ja veel mingi ähmane lubadus.	0