SentenceTransformer based on HooshvareLab/bert-base-parsbert-uncased

This sentence-transformers model is finetuned from HooshvareLab/bert-base-parsbert-uncased with a focus on enhancing Retrieval-Augmented Generation (RAG) systems. It maps sentences and paragraphs to a 768-dimensional dense vector space, making it highly effective for retrieving contextually relevant information to generate accurate and coherent responses in various applications such as QA systems, chatbots, and content generation.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: HooshvareLab/bert-base-parsbert-uncased
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("myrkur/sentence-transformer-parsbert-fa")
# Run inference
sentences = [
    'پرتغالی، در وطن اصلی خود، پرتغال، تقریباً توسط ۱۰ میلیون نفر جمعیت صحبت می\u200cشود. پرتغالی همچنین به عنوان زبان رسمی برزیل، بیش از ۲۰۰ میلیون نفر در آن کشور و همچنین کشورهای همسایه، در شرق پاراگوئه و در شمال اروگوئه، سخنگو دارد، که کمی بیش از نیمی از جمعیت آمریکای جنوبی را تشکیل می\u200cدهند؛ بنابراین پرتغالی پرسخنگوترین زبان رسمی رومی در یک کشور واحد است. این زبان در شش کشور آفریقایی زبان رسمی است (آنگولا، دماغه سبز، گینه بیسائو، موزامبیک، گینه استوایی و سائوتومه و پرنسیپ) و توسط ۳۰ میلیون نفر از ساکنان آن قاره به عنوان زبان نخست گویش می\u200cشود. در آسیا، پرتغالی با سایر زبان\u200cها در تیمور شرقی و ماکائو رسمی است، در حالی که بیشتر پرتغالی\u200cزبانان در آسیا - حدود ۴۰۰٫۰۰۰ نفر - به دلیل بازگشت مهاجرت ژاپنی\u200cهای برزیل ساکن ژاپن هستند. در آمریکای شمالی ۱٫۰۰۰٫۰۰۰ نفر به پرتغالی به عنوان زبان نخست خود صحبت می\u200cکنند. پرتغالی در اقیانوسیه به دلیل شمار سخنگویانش در تیمور شرقی، پس از فرانسوی، دومین زبان رومی است که بیش از همه گویش می\u200cشود. نزدیکترین خویشاوند آن، گالیسی، دارای وضعیت رسمی در جامعه خودمختار گالیسیا در اسپانیا، همراه با اسپانیایی است.',
    'در حدود اواخر کدام قرن پیش از میلاد سکاهای کوچ\u200cنشین در مرزهای شرقی اشکانیان پیشروی کردند؟',
    'عباس جدیدی که بود؟',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Usage in Retrieval-Augmented Generation (RAG) Systems

Retrieval-Augmented Generation (RAG) systems leverage a combination of retrieval and generation techniques to enhance the quality and accuracy of generated responses. This model can be effectively used to retrieve relevant information from a large corpus, which can then be used to generate more informed and contextually accurate responses. Here's how you can integrate this model into a RAG system:

Install Necessary Libraries: Ensure you have the required libraries:

pip install -U sentence-transformers transformers

from sentence_transformers import SentenceTransformer, util
import torch

# Load the model
model = SentenceTransformer("myrkur/sentence-transformer-parsbert-fa")

# Example corpus
corpus = [
    'پرتغالی، در وطن اصلی خود، پرتغال، تقریباً توسط ۱۰ میلیون نفر جمعیت صحبت می‌شود...',
    'اشکانیان حدود دو قرن بر ایران حکومت کردند...',
    'عباس جدیدی، کشتی‌گیر سابق ایرانی است...',
    # ... (more documents)
]

# Encode the corpus
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

Retrieve Relevant Information: Given a user query, retrieve the most relevant documents from the corpus:

# User query
query = "عباس جدیدی که بود؟"
query_embedding = model.encode(query, convert_to_tensor=True)

# Retrieve the top-k most similar documents
top_k = 5
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)
hits = hits[0]

# Print the retrieved documents
for hit in hits:
    print(f"Score: {hit['score']:.4f}")
    print(corpus[hit['corpus_id']])

Conclusion

This sentence-transformer model is a powerful tool for various NLP applications, particularly in retrieval-augmented generation systems, enabling more accurate and contextually relevant information retrieval and generation.

Contact

For questions or further information, please contact:

Amir Masoud Ahmadi: amirmasoud.ahkol@gmail.com

myrkur
/

sentence-transformer-parsbert-fa