--- language: - eng - hin library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction widget: - source_sentence: प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए sentences: - प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए - Pranav studied law and became a politician at the age of 30. - Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye. - Pranav ne law ki padhai kari aur 30 ki umar mein politics se jud gaye. - प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था - Pranav was born in a family of politicians - Pranav ka janm rajneetigyon ke parivar mein hua tha - Pranav ka janm politicians ki family mein hua tha - source_sentence: Is baar diwali par main 15 din ke liye ghar ja rahi hoon sentences: - october mein deepawali ki chhutiyan hai sabhi ke. - Mere parivaar mein tyoharon pe devi puja ki parampara hai - Pavan ne kanoon ki padhai ki aur 30 ki umar mein rajneeti se jud gaye pipeline_tag: sentence-similarity license: apache-2.0 --- ## Bhasha embed v0 model This is an embedding model that can embed texts in Hindi (Devanagari script), English and Romanized Hindi. There are many multilingual embedding models which work well for Hindi and English texts individually, but lack the following capabilities. 1. **Romanized Hindi support**: This is the first embedding model to support Romanized Hindi (transliterated Hindi / hin_Latn). 2. **Cross-lingual alignment**: This model outputs language-agnostic embedding. This enables querying a multilingual candidate pool containing a mix of Hindi, English and Romanised Hindi texts. ## Model Details - **Supported Languages:** Hindi, English, Romanised Hindi - **Base model:** [google/muril-base-cased](https://huggingface.co/google/muril-base-cased) - **Training GPUs:** 1xRTX4090 - **Training methodology:** Distillation from English embedding model and Fine-tuning on triplet data. - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 tokens - **Similarity Function:** Cosine Similarity ### Model Sources - **Repository:** [github_link](https://github.com/akshita-sukhlecha/bhasha-embed) - **Developer:** [Akshita Sukhlecha](https://www.linkedin.com/in/akshita-sukhlecha/) --- ## Results Results for English-Hindi cross-lingual alignment : Tasks with corpus containing texts in Hindi as well as English Results for Romanised Hindi tasks : Tasks with texts in Romanised Hindi Results for retrieval tasks with multilingual corpus : Retrieval task with corpus containing texts in Hindi, English as well as Romanised Hindi Results for Hindi tasks : Tasks with texts in Hindi (Devanagari script) ### Additional information - Some task dataset links: [Belebele](https://huggingface.co/datasets/facebook/belebele), [MLQA](https://huggingface.co/datasets/facebook/mlqa), [XQuAD](https://huggingface.co/datasets/google/xquad), [SemRel24](https://huggingface.co/datasets/SemRel/SemRel2024) - hin_Latn tasks: Most hin_Latn tasks have been created by transliterating hindi texts using [indic-trans library](https://github.com/libindic/indic-trans) - Detailed results: [github_link](https://github.com/akshita-sukhlecha/bhasha-embed/blob/main/eval/results/eval_results.csv) - Script to reproduce the results: [github_link](https://github.com/akshita-sukhlecha/bhasha-embed/blob/main/eval/evaluator.py) --- ## Sample outputs ### Example 1 ### Example 2 ### Example 3 ### Example 4 --- ## Usage Below are examples to encode queries and passages and compute similarity scores using Sentence Transformers and 🤗 Transformers. ### Using Sentence Transformers First install the Sentence Transformers library (`pip install -U sentence-transformers`) and then run the following code: ```python import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer("AkshitaS/bhasha-embed-v0") queries = [ "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", "Pranav studied law and became a politician at the age of 30.", "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye" ] documents = [ "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", "Pranav studied law and became a politician at the age of 30.", "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye", "प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था", "Pranav was born in a family of politicians", "Pranav ka janm rajneetigyon ke parivar mein hua tha" ] query_embeddings = model.encode(queries, normalize_embeddings=True) document_embeddings = model.encode(documents, normalize_embeddings=True) similarity_matrix = (query_embeddings @ document_embeddings.T) print(similarity_matrix.shape) # (3, 6) print(np.round(similarity_matrix, 2)) #[[1.00 0.97 0.97 0.92 0.90 0.91] # [0.97 1.00 0.96 0.90 0.91 0.91] # [0.97 0.96 1.00 0.89 0.90 0.92]] ``` ### Using 🤗 Transformers ```python import numpy as np from torch import Tensor import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] model_id = "AkshitaS/bhasha-embed-v0" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id) queries = [ "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", "Pranav studied law and became a politician at the age of 30.", "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye" ] documents = [ "प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", "Pranav studied law and became a politician at the age of 30.", "Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye", "प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था", "Pranav was born in a family of politicians", "Pranav ka janm rajneetigyon ke parivar mein hua tha" ] input_texts = queries + documents batch_dict = tokenizer(input_texts, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) embeddings = F.normalize(embeddings, p=2, dim=1) similarity_matrix = (embeddings[:len(queries)] @ embeddings[len(queries):].T).detach().numpy() print(similarity_matrix.shape) # (3, 6) print(np.round(similarity_matrix, 2)) #[[1.00 0.97 0.97 0.92 0.90 0.91] # [0.97 1.00 0.96 0.90 0.91 0.91] # [0.97 0.96 1.00 0.89 0.90 0.92]] ``` ### Citation To cite this model: ``` @misc{sukhlecha_2024_bhasha_embed_v0, author = {Sukhlecha, Akshita}, title = {Bhasha-embed-v0}, howpublished = {Hugging Face}, month = {June}, year = {2024}, url = {https://huggingface.co/AkshitaS/bhasha-embed-v0} } ```