--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:MultipleNegativesRankingLoss pipeline_tag: sentence-similarity library_name: sentence-transformers license: apache-2.0 language: - ar base_model: - aubmindlab/bert-base-arabertv02 --- ### 🚀 Arabic-Retrieval-v1.0 This is a high-performance Arabic information retrieval built using the robust **sentence-transformers** framework, it delivers **state-of-the-art performance** and is tailored to the richness and complexity of the Arabic language. --- ## 🔑 Key Features - **🔥 Outstanding Performance**: Matches the accuracy of top-tier multilingual models like `e5-multilingual-large`. See [evaluation](https://huggingface.co/omarelshehy/Arabic-retrieval-v1.0#evaluation) - **💡 Arabic-Focused**: Designed specifically for the nuances and dialects of Arabic, ensuring more accurate and context-aware results. - **📉 Lightweight Efficiency**: Requires **25%-50% less memory**, making it ideal for environments with limited resources or edge deployments. --- ## 🌍 Why This Model? Multilingual models are powerful, but they’re often bulky and not optimized for specific languages. This model bridges that gap, offering Arabic-native capabilities without sacrificing performance or efficiency. Whether you’re working on search engines, chatbots, or large-scale NLP pipelines, this model provides a **fast, accurate, and resource-efficient solution**. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 tokens - **Similarity Function:** Cosine Similarity ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ## Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. It is important to add the prefixes \: and \: to your queries and passages while retrieving in the folllowing way: ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("omarelshehy/Arabic-Retrieval-v1.0") # Query query = ": كيف يمكن للذكاء الاصطناعي تحسين طرق التدريس التقليدية؟" # Passages passages = [ ": طرق التدريس التقليدية تستفيد من الذكاء الاصطناعي عبر تحسين عملية المتابعة وتخصيص التجربة التعليمية. يقوم الذكاء الاصطناعي بتحليل بيانات الطلاب وتقديم توصيات فعالة للمعلمين حول طرق التدريس الأفضل.", ": تطوير التعليم الشخصي يعتمد بشكل كبير على الذكاء الاصطناعي، الذي يقوم بمتابعة تقدم الطلاب بشكل فردي. يقدم الذكاء الاصطناعي حلولاً تعليمية مخصصة لكل طالب بناءً على مستواه وأدائه.", ": الدقة في تقييم الطلاب تتزايد بفضل الذكاء الاصطناعي الذي يقارن النتائج مع معايير متقدمة. بالرغم من التحديات التقليدية، الذكاء الاصطناعي يوفر أدوات تحليل تتيح تقييماً أدق لأداء الطلاب." ] # Encode query and passages embeddings_query = model.encode(queries) embeddings_passages = model.encode(passages) # Get the similarity scores for the embeddings similarities = model.similarity(embeddings_query, embeddings_passages) # Get best matching passage to query best_match = passages[similarities.argmax().item()] print(f"Best matching passage is {best_match}") ``` ## Evaluation This model has been ealuated using 3 different datasets and the NDCG@10 metric - Dataset 1: [castorini/mr-tydi](https://huggingface.co/datasets/castorini/mr-tydi) - Dataset 2: [Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset) - Dataset 3: [sadeem-ai/sadeem-ar-eval-retrieval-questions](https://huggingface.co/datasets/sadeem-ai/sadeem-ar-eval-retrieval-questions) and is compared to other highly performant models: | **model** | **1** | **2** | **3** | |-------------------------------------|-----------|--------------|-------------| | Arabic-Retrieval-v1.0 | 0.875 | **0.72** | 0.679 | | intfloat/multilingual-e5-large | **0.89** | 0.719 | **0.698** | | intfloat/multilingual-e5-base | 0.87 | 0.69 | 0.686 | #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```