--- language: pl tags: - fastText datasets: - kgr10 --- # KGR10 FastText Polish word embeddings Distributional language model (both textual and binary) for Polish (word embeddings) trained on KGR10 corpus (over 4 billion of words) using Fasttext with the following variants (all possible combinations): - dimension: 100, 300 - method: skipgram, cbow - tool: FastText, Magnitude - source text: plain, plain.lower, plain.lemma, plain.lemma.lower ## Models In the repository you can find 4 selected models, that were examined in the paper (see Citation). A model that performed the best is the default model/config (see `default_config.json`). ## Usage To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings). ```bash pip install clarinpl-embeddings ``` ### Utilising the default model (the easiest way) Word embedding: ```python from embeddings.embedding.auto_flair import AutoFlairWordEmbedding from flair.data import Sentence sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/fastText-kgr10") embedding.embed([sentence]) for token in sentence: print(token) print(token.embedding) ``` Document embedding (averaged over words): ```python from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding from flair.data import Sentence sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/fastText-kgr10") embedding.embed([sentence]) print(sentence.embedding) ``` ### Customisable way Word embedding: ```python from embeddings.embedding.static.embedding import AutoStaticWordEmbedding from embeddings.embedding.static.fasttext import KGR10FastTextConfig from flair.data import Sentence config = KGR10FastTextConfig(method='cbow', dimension=100) embedding = AutoStaticWordEmbedding.from_config(config) sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") embedding.embed([sentence]) for token in sentence: print(token) print(token.embedding) ``` Document embedding (averaged over words): ```python from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding from embeddings.embedding.static.fasttext import KGR10FastTextConfig from flair.data import Sentence config = KGR10FastTextConfig(method='cbow', dimension=100) embedding = AutoStaticDocumentEmbedding.from_config(config) sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") embedding.embed([sentence]) print(sentence.embedding) ``` ## Citation The link below leads to the NextCloud directory with all variants of embeddings. If you use it, please cite the following article: ``` @article{kocon2018embeddings, author = {Koco\'{n}, Jan and Gawor, Micha{\l}}, title = {Evaluating {KGR10} {P}olish word embeddings in the recognition of temporal expressions using {BiLSTM-CRF}}, journal = {Schedae Informaticae}, volume = {27}, year = {2018}, url = {http://www.ejournals.eu/Schedae-Informaticae/2018/Volume-27/art/13931/}, doi = {10.4467/20838476SI.18.008.10413} } ```