back_rag_huggingface / model_data_json /AITeamVN_Vietnamese_Embedding.json
shayan5422's picture
Upload 3710 files
21cad66 verified
{
"model_id": "AITeamVN/Vietnamese_Embedding",
"downloads": 13362,
"tags": [
"sentence-transformers",
"safetensors",
"xlm-roberta",
"Embedding",
"sentence-similarity",
"vi",
"base_model:BAAI/bge-m3",
"base_model:finetune:BAAI/bge-m3",
"license:apache-2.0",
"autotrain_compatible",
"text-embeddings-inference",
"endpoints_compatible",
"region:us"
],
"description": "--- license: apache-2.0 language: - vi base_model: - BAAI/bge-m3 pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - Embedding --- ## Model Card: Vietnamese_Embedding Vietnamese_Embedding is an embedding model fine-tuned from the BGE-M3 model ( to enhance retrieval capabilities for Vietnamese. * The model was trained on approximately 300,000 triplets of queries, positive documents, and negative documents for Vietnamese. * The model was trained with a maximum sequence length of 2048. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** BAAI/bge-m3 - **Maximum Sequence Length:** 2048 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Dot product Similarity - **Language:** Vietnamese - **Licence:** Apache 2.0 ## Usage ### Evaluation: - Dataset: Entire training dataset of Legal Zalo 2021. Our model was not trained on this dataset. | Model | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | MRR@10 | |----------------------|------------|------------|------------|-------------|--------------| | Vietnamese_Reranker | 0.7944 | 0.9324 | 0.9537 | 0.9740 | 0.8672 | | Vietnamese_Embedding_v2 | 0.7262 | 0.8927 | 0.9268 | 0.9578 | 0.8149 | | Vietnamese_Embedding (public) | 0.7274 | 0.8992 | 0.9305 | 0.9568 | 0.8181 | | Vietnamese-bi-encoder (BKAI) | 0.7109 | 0.8680 | 0.9014 | 0.9299 | 0.7951 | | BGE-M3 | 0.5682 | 0.7728 | 0.8382 | 0.8921 | 0.6822 | Vietnamese_Reranker and Vietnamese_Embedding_v2 was trained on 1100000 triplets. Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2, since this phase data is much larger, it is very good for other domains. You can access 2 model via link: Vietnamese_Embedding_v2, Vietnamese_Reranker You can reproduce the evaluation result by running code python evaluation_model.py (data downloaded from Kaggle). ## Contact Email: nguyennhotrung3004@gmail.com **Developer** Member: Nguyễn Nho Trung, Nguyễn Nhật Quang ## Citation",
"model_explanation_gemini": "\"Fine-tuned from BGE-M3 to generate Vietnamese sentence embeddings for improved retrieval tasks, with a 1024-dimensional output and 2048-token sequence length.\"\n\n### Model Features: \n- **Base Model:** BAAI/bge-m3 \n- **Task:** Sentence similarity (dot product) \n- **Language:** Vietnamese \n- **Sequence Length:** 2048 tokens \n- **Output Dimension:** 1024 \n- **Training Data:** ~300K Vietnamese query-d",
"release_year": null,
"parameter_count": null,
"is_fine_tuned": true,
"category": "Embedding",
"model_family": "BERT",
"api_enhanced": true
}