--- language: - multilingual - ar - bg - ca - cs - da - de - el - en - es - et - fa - fi - fr - gl - gu - he - hi - hr - hu - hy - id - it - ja - ka - ko - ku - lt - lv - mk - mn - mr - ms - my - nb - nl - pl - pt - ro - ru - sk - sl - sq - sr - sv - th - tr - uk - ur - vi license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers language_bcp47: - fr-ca - pt-br - zh-cn - zh-tw pipeline_tag: sentence-similarity --- # lang-uk/ukr-paraphrase-multilingual-mpnet-base This is a [sentence-transformers](https://www.SBERT.net) model fine-tuned for Ukrainian language: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The original model used for fine-tuning is `sentence-transformers/paraphrase-multilingual-mpnet-base-v2`. See our paper [Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation](https://aclanthology.org/2023.unlp-1.2/) for details. ## Usage (Sentence-Transformers) Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: ``` pip install -U sentence-transformers ``` Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('lang-uk/ukr-paraphrase-multilingual-mpnet-base') embeddings = model.encode(sentences) print(embeddings) ``` ## Usage (HuggingFace Transformers) Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. ```python from transformers import AutoTokenizer, AutoModel import torch #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Sentences we want sentence embeddings for sentences = ['This is an example sentence', 'Each sentence is converted'] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('lang-uk/ukr-paraphrase-multilingual-mpnet-base') model = AutoModel.from_pretrained('lang-uk/ukr-paraphrase-multilingual-mpnet-base') # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, average pooling sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings) ``` ## Citing & Authors If you find this model helpful, feel free to cite our publication [Contextual Embeddings for {U}krainian: A Large Language Model Approach to Word Sense Disambiguation](https://aclanthology.org/2023.unlp-1.2/): ```bibtex @inproceedings{laba-etal-2023-contextual, title = "Contextual Embeddings for {U}krainian: A Large Language Model Approach to Word Sense Disambiguation", author = "Laba, Yurii and Mudryi, Volodymyr and Chaplynskyi, Dmytro and Romanyshyn, Mariana and Dobosevych, Oles", editor = "Romanyshyn, Mariana", booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.unlp-1.2", doi = "10.18653/v1/2023.unlp-1.2", pages = "11--19", abstract = "This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Ukrainian language based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on the dataset generated in an unsupervised way to obtain better contextual embeddings for words with multiple senses. The paper presents a method for generating a new dataset for WSD evaluation in the Ukrainian language based on the SUM dictionary. We developed a comprehensive framework that facilitates the generation of WSD evaluation datasets, enables the use of different prediction strategies, LLMs, and pooling strategies, and generates multiple performance reports. Our approach shows 77,9{\%} accuracy for lexical meaning prediction for homonyms.", } ``` Copyright: Yurii Laba, Volodymyr Mudryi, Dmytro Chaplynskyi, Mariana Romanyshyn, Oles Dobosevych, [Ukrainian Catholic University](https://ucu.edu.ua/en/), [lang-uk project](https://lang.org.ua/en/), 2023 An original model used for fine-tuning was trained by [sentence-transformers](https://www.sbert.net/).