Word2vec
/

german_model

Model card Files Files and versions Community

lbourdois commited on May 28, 2023

Commit

37137f3

•

1 Parent(s): cd14ece

Create README.md

Files changed (1) hide show

README.md +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+tags:
+- word2vec
+language: de
+license: mit
+datasets:
+- wikipedia
+---
+## Description
+German word embedding model trained by Müller with the following parameter configuration:
+- a corpus as big as possible (and as diverse as possible without being informal)    filtering of punctuation and stopwords
+- forming bigramm tokens
+- using skip-gram as training algorithm with hierarchical softmax
+- window size between 5 and 10
+- dimensionality of feature vectors of 300 or more
+- using negative sampling with 10 samples
+- ignoring all words with total frequency lower than 50
+For more information, see [https://devmount.github.io/GermanWordEmbeddings/](https://devmount.github.io/GermanWordEmbeddings/)
+## How to use?
+```
+from gensim.models import KeyedVectors
+from huggingface_hub import hf_hub_download
+model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore")
+model.most_similar("exemple")```
+## Citation
+```
+@thesis{mueller2015,
+  author = {{Müller}, Andreas},
+  title  = "{Analyse von Wort-Vektoren deutscher Textkorpora}",
+  school = {Technische Universität Berlin},
+  year   = 2015,
+  month  = jun,
+  type   = {Bachelor's Thesis},
+  url    = {https://devmount.github.io/GermanWordEmbeddings}
+}
+```