dwdsmor-open / README.md
gremid's picture
Upload folder using huggingface_hub
a53cc16 verified
---
language: de
library_name: sfst
license: gpl-2.0
tags:
- sfst
- dwdsmor
- token-classification
- lemmatisation
model-index:
- name: dwdsmor
results:
- task:
type: token-classification
name: Lemmatisation
dataset:
name: Universal Dependencies Treebank (de-hdt)
type: universal_dependencies
config: de_hdt
split: train
metrics:
- type: coverage
value: 0.8415293963067323
name: Coverage
- type: coverage
value: 1.0
name: Coverage ($()
- type: coverage
value: 1.0
name: Coverage ($,)
- type: coverage
value: 0.9999580703997988
name: Coverage ($.)
- type: coverage
value: 0.774030155216797
name: Coverage (ADJA)
- type: coverage
value: 0.7548407611333322
name: Coverage (ADJD)
- type: coverage
value: 0.9682621529723873
name: Coverage (ADV)
- type: coverage
value: 0.9989939637826962
name: Coverage (APPO)
- type: coverage
value: 0.9308645050358152
name: Coverage (APPR)
- type: coverage
value: 0.9967651071695788
name: Coverage (APPRART)
- type: coverage
value: 0.7916666666666666
name: Coverage (APZR)
- type: coverage
value: 0.9999603964317185
name: Coverage (ART)
- type: coverage
value: 0.9613524039049266
name: Coverage (CARD)
- type: coverage
value: 0.13320473120462967
name: Coverage (FM)
- type: coverage
value: 0.7142857142857143
name: Coverage (ITJ)
- type: coverage
value: 1.0
name: Coverage (KOKOM)
- type: coverage
value: 0.9995274949083504
name: Coverage (KON)
- type: coverage
value: 1.0
name: Coverage (KOUI)
- type: coverage
value: 0.9858579967925354
name: Coverage (KOUS)
- type: coverage
value: 0.0618080812117821
name: Coverage (NE)
- type: coverage
value: 0.7440482047389456
name: Coverage (NN)
- type: coverage
value: 0.9799275737196068
name: Coverage (PDAT)
- type: coverage
value: 0.9995682832062167
name: Coverage (PDS)
- type: coverage
value: 0.9879094306440976
name: Coverage (PIAT)
- type: coverage
value: 1.0
name: Coverage (PIDAT)
- type: coverage
value: 0.9951910051476565
name: Coverage (PIS)
- type: coverage
value: 0.999888876541838
name: Coverage (PPER)
- type: coverage
value: 1.0
name: Coverage (PPOSAT)
- type: coverage
value: 1.0
name: Coverage (PPOSS)
- type: coverage
value: 1.0
name: Coverage (PRELAT)
- type: coverage
value: 1.0
name: Coverage (PRELS)
- type: coverage
value: 1.0
name: Coverage (PRF)
- type: coverage
value: 0.9861938278289117
name: Coverage (PROAV)
- type: coverage
value: 0.3082133784928027
name: Coverage (PTKA)
- type: coverage
value: 1.0
name: Coverage (PTKANT)
- type: coverage
value: 1.0
name: Coverage (PTKNEG)
- type: coverage
value: 0.7705097087378641
name: Coverage (PTKVZ)
- type: coverage
value: 0.0
name: Coverage (PTKZU)
- type: coverage
value: 0.9551166965888689
name: Coverage (PWAT)
- type: coverage
value: 0.9937264742785445
name: Coverage (PWAV)
- type: coverage
value: 0.9946524064171123
name: Coverage (PWS)
- type: coverage
value: 1.0
name: Coverage (VAFIN)
- type: coverage
value: 1.0
name: Coverage (VAIMP)
- type: coverage
value: 1.0
name: Coverage (VAINF)
- type: coverage
value: 1.0
name: Coverage (VAPP)
- type: coverage
value: 1.0
name: Coverage (VMFIN)
- type: coverage
value: 1.0
name: Coverage (VMINF)
- type: coverage
value: 1.0
name: Coverage (VMPP)
- type: coverage
value: 0.886487187323461
name: Coverage (VVFIN)
- type: coverage
value: 0.9596122778675282
name: Coverage (VVIMP)
- type: coverage
value: 0.8214535019002559
name: Coverage (VVINF)
- type: coverage
value: 0.829683698296837
name: Coverage (VVIZU)
- type: coverage
value: 0.7996866513473992
name: Coverage (VVPP)
- type: coverage
value: 0.4148471615720524
name: Coverage (XY)
---
# DWDSmor
_SFST/SMOR/DWDS-based German morphology_
DWDSmor implements the lemmatisation and morphological analysis of
word forms as well as the generation of paradigms of lexical words in
written German.
## Usage
DWDSmor is available via PyPI:
``` plaintext
pip install dwdsmor
```
For lemmatisation:
``` python-console
>>> import dwsdmor
>>> lemmatizer = dwdsmor.lemmatizer()
>>> assert lemmatizer("getestet", pos={"+V"}) == "testen"
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
```
## Development
This repository provides source code for building DWDSmor lexica and transducers
as well as for using DWDSmor transducers for morphological analysis and paradigm
generation:
* `dwdsmor/` contains Python packages for using DWDSmor, including
scripts for morphological analysis and for paradigm generation by
means of DWDSmor transducers.
* `share/` contains XSLT stylesheets for extracting lexical entries in SMORLemma
format form XML sources of DWDS articles. Sample inputs and outputs can be
found in `samples/`.
* `lexicon/dwds/` contains scripts for building DWDSmor lexica by means of the
XSLT stylesheets in `share/` and DWDS sources in `lexicon/dwds/wb/`, which are
not part of this repository.
* `lexicon/sample/` contains scripts for building sample DWDSmor lexica by means
of the XSLT stylesheets in `share/` and the sample lexicon in
`lexicon/sample/wb/`.
* `grammar/` contains an FST grammar derived from SMORLemma, providing the
morphology for building DWDSmor automata from DWDSmor lexica.
* `test/` implements a test suite for the DWDSmor transducers.
DWDSmor is in active development. In its current stage, DWDSmor supports most
inflection classes and some productive word-formation patterns of written
German. Note that the sample lexicon in `lexicon/sample/wb/` only covers a
sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled
from it.
## Prerequisites
[GNU/Linux](https://www.debian.org/)
: Development, builds and tests of DWDSmor are performed
on [Debian GNU/Linux](https://debian.org/). While other UNIX-like operating
systems such as MacOS should work, too, they are not actively supported.
[Python >= v3.9](https://www.python.org/)
: DWDSmor targets Python as its primary runtime environment. The DWDSmor
transducers can be used via SFST's commandline tools, queried in Python
applications via language-specific
[bindings](https://github.com/gremid/sfst-transduce), or used by the Python
scripts `dwdsmor.py` and `paradigm.py` for morphological analysis and for
paradigm generation.
[Saxon-HE](https://www.saxonica.com/)
: The extraction of lexical entries from XML sources of DWDS articles is
implemented in XSLT 2, for which Saxon-HE is used as the runtime environment.
[Java (JDK) >= v8](https://openjdk.java.net/)
: Saxon requires a Java runtime.
[SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/)
: a C++ library and toolbox for finite-state transducers (FSTs); please take a
look at its homepage for installation and usage instructions.
On a Debian-based distribution, install the following packages:
```sh
apt install python3 default-jdk libsaxonhe-java sfst
```
Set up a virtual environment for project builds, for example via Python's `venv`:
```sh
python3 -m venv .venv
source .venv/bin/activate
```
Then run the DWDSmor setup routine in order to install Python dependencies:
```sh
pip install -e .[dev]
```
## Building DWDSmor lexica and transducers
For building DWDSmor lexica and transducers, run:
```sh
make all
```
Alternatively, you can run:
```sh
make dwds && make dwds-install && make dwdsmor
```
Note that these commands require DWDS sources in `lexicon/dwds/wb/`, which are
not part of this repository.
Alternatively, you can build sample DWDSmor lexica and transducers from the
sample lexicon in `lexicon/sample/wb/` by running:
```sh
make sample && make sample-install && make dwdsmor
```
After building DWDSmor transducers, install them into `lib/`, where the
Python scripts `dwdsmor` and `dwdsmor-paradigm` expect them by default:
```sh
make install
```
The installed DWDSmor transducers are:
* `lib/dwdsmor.{a,ca}`: transducer with inflection and word-formation
components, for lemmatisation and morphological analysis of word forms in
terms of grammatical categories
* `lib/dwdsmor-morph.{a,ca}`: transducer with inflection and word-formation
components, for the generation of morphologically segmented word forms
* `lib/dwdsmor-finite.{a,ca}`: transducer with an inflection component and a
finite word-formation component, for testing purposes
* `lib/dwdsmor-root.{a,ca}`: transducer with inflection and word-formation
components, for lexical analysis of word forms in terms of root lemmas (i.e.,
lemmas of ultimate word-formation bases), word-formation process,
word-formation means, and grammatical categories in term of the
Pattern-and-Restriction Theory of word formation (Nolda 2022)
* `lib/dwdsmor-index.{a,ca}`: transducer with an inflection component only with
DWDS homographic lemma indices, for paradigm generation
## Testing DWDSmor
Run
pytest
in order to test basic transducer usage and for potential regressions.
## Contact
Feel free to contact [Andreas Nolda](mailto:andreas.nolda@bbaw.de) for
questions regarding the lexicon or the grammar and
[Gregor Middell](mailto:gregor.middell@bbaw.de) for question related
to the integration of DWDSmor into your corpus-annotation pipeline.
## License
As the original SMOR and SMORLemma grammars, the DWDSmor grammar is
licensed under the GNU General Public Licence v2.0. The same applies
to the rest of this project.
## Credits
DWSDmor is based on the following software and datasets:
1. [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/), a C++ library
and toolbox for finite-state transducers (FSTs) (Schmidt 2006)
2. [SMORLemma](https://github.com/rsennrich/SMORLemma) (Sennrich and Kunz 2014),
a modified version of the Stuttgart Morphology
([SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/)) (Schmid, Fitschen, and
Heid 2004) with an alternative lemmatisation component
3. the [DWDS dictionary](https://www.dwds.de/) (BBAW n.d.) replacing the
[IMSLex](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/imslex/)
(Fitschen 2004) as the lexical data source for German words, their grammatical
categories, and their morphological properties.
## Bibliography
* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
deutschen Sprache in Geschichte und Gegenwart.
https://www.dwds.de
* Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
System. Ph.D. thesis, Universität Stuttgart.
[PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)
* Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on
compounding and blending in German. In *Headedness and/or Grammatical
Anarchy?*, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically
Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press,
343–376.
[PDF](https://zenodo.org/record/7142720/files/336-FreywaldSimonMüller-2022-11.pdf).
* Schmid, Helmut (2006). A programming language for finite state transducers. In
*Finite-State Methods and Natural Language Processing: 5th International
Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005*, ed. by Anssi
Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial
Intelligence 4002, Berlin: Springer, 1263–1266.
[PDF](https://www.cis.uni-muenchen.de/~schmid/papers/SFST-PL.pdf).
* Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German
computational morphology covering derivation, composition, and inflection. In
LREC 2004: Fourth International Conference on Language Resources and
Evaluation, ed. by Maria T. Lino *et al.*, European Language Resources
Association, 1263–1266.
[PDF](http://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdf)
* Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon
extracted from Wiktionary. In LREC 2014: Ninth International Conference on
Language Resources and Evaluation, ed. by Nicoletta Calzolari *et al.*,
European Language Resources Association, 1063–1067.
[PDF](http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf).