lbourdois commited on
Commit
37137f3
1 Parent(s): cd14ece

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - word2vec
4
+ language: de
5
+ license: mit
6
+ datasets:
7
+ - wikipedia
8
+ ---
9
+
10
+ ## Description
11
+ German word embedding model trained by Müller with the following parameter configuration:
12
+ - a corpus as big as possible (and as diverse as possible without being informal) filtering of punctuation and stopwords
13
+ - forming bigramm tokens
14
+ - using skip-gram as training algorithm with hierarchical softmax
15
+ - window size between 5 and 10
16
+ - dimensionality of feature vectors of 300 or more
17
+ - using negative sampling with 10 samples
18
+ - ignoring all words with total frequency lower than 50
19
+
20
+ For more information, see [https://devmount.github.io/GermanWordEmbeddings/](https://devmount.github.io/GermanWordEmbeddings/)
21
+
22
+ ## How to use?
23
+
24
+ ```
25
+ from gensim.models import KeyedVectors
26
+ from huggingface_hub import hf_hub_download
27
+ model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore")
28
+ model.most_similar("exemple")```
29
+
30
+ ## Citation
31
+
32
+ ```
33
+ @thesis{mueller2015,
34
+ author = {{Müller}, Andreas},
35
+ title = "{Analyse von Wort-Vektoren deutscher Textkorpora}",
36
+ school = {Technische Universität Berlin},
37
+ year = 2015,
38
+ month = jun,
39
+ type = {Bachelor's Thesis},
40
+ url = {https://devmount.github.io/GermanWordEmbeddings}
41
+ }
42
+ ```