File size: 3,245 Bytes
3b2aafd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77709e5
3b2aafd
 
 
dcdaeda
0582518
dcdaeda
 
3b2aafd
 
 
 
 
 
0064e22
3b2aafd
0064e22
3b2aafd
94b8006
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b2aafd
 
 
94b8006
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
tags:
- spacy
- token-classification
language:
- de
model-index:
- name: de_pipeline
  results:
  - task:
      name: TAG
      type: token-classification
    metrics:
    - name: TAG (XPOS) Accuracy
      type: accuracy
      value: 0.9191333537
license: cc-by-4.0
library_name: spacy
---
## de_STTS2_folk tagger

This is a spaCy language model trained to use the Stuttgart-Tübingen Tagset version 2.0, which was designed to tag transcripts of conversational speech in German. 
The model may be useful for tagging ASR transcripts such as those collected in the [CoGS](https://cc.oulu.fi/~scoats/CoGS.html) corpus.

The model was trained using the tag annotations from the FOLK corpus at https://agd.ids-mannheim.de/folk-gold.shtml, employing an 80/20 training/test split. Tokens in the training data for the model were converted to lower case prior to traning to match the format used for automatic speech recognition transcripts on YouTube, as of early 2023.

Usage example:
```python
!pip install https://huggingface.co/stcoats/de_STTS2_folk/resolve/main/de_STTS2_folk-any-py3-none-any.whl
import spacy
import de_STTS2_folk
nlp = de_STTS2_folk.load()
doc = nlp("ach so meinst du wir sollen es jetzt tun")
for token in doc:
    print(token.text, token.tag_)
```
### References

Coats, Steven. (2023). A new corpus of geolocated ASR transcripts from Germany. <i>Language Resources and Evaluation</i>. https://doi.org/10.1007/s10579-023-09686-9

Westpfahl, Swantje and Thomas Schmidt. (2016): [FOLK-Gold – A GOLD standard for Part-of-Speech-Tagging of Spoken German](https://aclanthology.org/L16-1237). In: <i>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia</i>

---
tags:
- spacy
- token-classification
language:
- de
model-index:
- name: de_STTS2_folk
  results:
  - task:
      name: TAG
      type: token-classification
    metrics:
    - name: TAG (XPOS) Accuracy
      type: accuracy
      value: 0.9191333537
---
| Feature | Description |
| --- | --- |
| **Name** | `de_STTS2_folk` |
| **Version** | `0.0.1` |
| **spaCy** | `>=3.5.1,<3.6.0` |
| **Default Pipeline** | `tok2vec`, `tagger` |
| **Components** | `tok2vec`, `tagger` |
| **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
| **Sources** | Swantje Westpfahl and Thomas Schmidt, FOLK-Gold, https://agd.ids-mannheim.de/folk-gold.shtml |
| **License** | CC-BY 4.0 |
| **Author** | Steven Coats |

### Label Scheme

<details>

<summary>View label scheme (62 labels for 1 components)</summary>

| Component | Labels |
| --- | --- |
| **`tagger`** | `$.`, `AB`, `ADJA`, `ADJD`, `ADV`, `APPO`, `APPR`, `APPRART`, `APZR`, `ART`, `CARD`, `FM`, `KOKOM`, `KON`, `KOUI`, `KOUS`, `NE`, `NGAKW`, `NGHES`, `NGIRR`, `NGONO`, `NN`, `ORD`, `PDAT`, `PDS`, `PIAT`, `PIDAT`, `PIDS`, `PIS`, `PPER`, `PPOSAT`, `PPOSS`, `PRELAT`, `PRELS`, `PRF`, `PTKA`, `PTKIFG`, `PTKMA`, `PTKMWL`, `PTKNEG`, `PTKVZ`, `PTKZU`, `PWAT`, `PWAV`, `PWS`, `SEDM`, `SEQU`, `SPELL`, `TRUNC`, `UI`, `VAFIN`, `VAIMP`, `VAINF`, `VAPP`, `VMFIN`, `VMINF`, `VVFIN`, `VVIMP`, `VVINF`, `VVIZU`, `VVPP`, `XY` |

</details>

### Accuracy

| Type | Score |
| --- | --- |
| `TAG_ACC` | 91.91 |
| `TOK2VEC_LOSS` | 478891.28 |
| `TAGGER_LOSS` | 402526.03 |