File size: 1,278 Bytes
37137f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1dfd63a
37137f3
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
tags:
- word2vec
language: de
license: mit
datasets:
- wikipedia
---

## Description
German word embedding model trained by Müller with the following parameter configuration:
- a corpus as big as possible (and as diverse as possible without being informal)    filtering of punctuation and stopwords
- forming bigramm tokens
- using skip-gram as training algorithm with hierarchical softmax
- window size between 5 and 10
- dimensionality of feature vectors of 300 or more
- using negative sampling with 10 samples
- ignoring all words with total frequency lower than 50

For more information, see [https://devmount.github.io/GermanWordEmbeddings/](https://devmount.github.io/GermanWordEmbeddings/)

## How to use?

```
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore")
```

## Citation

```
@thesis{mueller2015,
  author = {{Müller}, Andreas},
  title  = "{Analyse von Wort-Vektoren deutscher Textkorpora}",
  school = {Technische Universität Berlin},
  year   = 2015,
  month  = jun,
  type   = {Bachelor's Thesis},
  url    = {https://devmount.github.io/GermanWordEmbeddings}
}
```