--- tags: - word2vec language: de license: mit datasets: - wikipedia --- ## Description German word embedding model trained by Müller with the following parameter configuration: - a corpus as big as possible (and as diverse as possible without being informal) filtering of punctuation and stopwords - forming bigramm tokens - using skip-gram as training algorithm with hierarchical softmax - window size between 5 and 10 - dimensionality of feature vectors of 300 or more - using negative sampling with 10 samples - ignoring all words with total frequency lower than 50 For more information, see [https://devmount.github.io/GermanWordEmbeddings/](https://devmount.github.io/GermanWordEmbeddings/) ## How to use? ``` from gensim.models import KeyedVectors from huggingface_hub import hf_hub_download model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore") ``` ## Citation ``` @thesis{mueller2015, author = {{Müller}, Andreas}, title = "{Analyse von Wort-Vektoren deutscher Textkorpora}", school = {Technische Universität Berlin}, year = 2015, month = jun, type = {Bachelor's Thesis}, url = {https://devmount.github.io/GermanWordEmbeddings} } ```