| { | |
| "model_id": "AITeamVN/Vietnamese_Embedding", | |
| "downloads": 13362, | |
| "tags": [ | |
| "sentence-transformers", | |
| "safetensors", | |
| "xlm-roberta", | |
| "Embedding", | |
| "sentence-similarity", | |
| "vi", | |
| "base_model:BAAI/bge-m3", | |
| "base_model:finetune:BAAI/bge-m3", | |
| "license:apache-2.0", | |
| "autotrain_compatible", | |
| "text-embeddings-inference", | |
| "endpoints_compatible", | |
| "region:us" | |
| ], | |
| "description": "--- license: apache-2.0 language: - vi base_model: - BAAI/bge-m3 pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - Embedding --- ## Model Card: Vietnamese_Embedding Vietnamese_Embedding is an embedding model fine-tuned from the BGE-M3 model ( to enhance retrieval capabilities for Vietnamese. * The model was trained on approximately 300,000 triplets of queries, positive documents, and negative documents for Vietnamese. * The model was trained with a maximum sequence length of 2048. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** BAAI/bge-m3 - **Maximum Sequence Length:** 2048 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Dot product Similarity - **Language:** Vietnamese - **Licence:** Apache 2.0 ## Usage ### Evaluation: - Dataset: Entire training dataset of Legal Zalo 2021. Our model was not trained on this dataset. | Model | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | MRR@10 | |----------------------|------------|------------|------------|-------------|--------------| | Vietnamese_Reranker | 0.7944 | 0.9324 | 0.9537 | 0.9740 | 0.8672 | | Vietnamese_Embedding_v2 | 0.7262 | 0.8927 | 0.9268 | 0.9578 | 0.8149 | | Vietnamese_Embedding (public) | 0.7274 | 0.8992 | 0.9305 | 0.9568 | 0.8181 | | Vietnamese-bi-encoder (BKAI) | 0.7109 | 0.8680 | 0.9014 | 0.9299 | 0.7951 | | BGE-M3 | 0.5682 | 0.7728 | 0.8382 | 0.8921 | 0.6822 | Vietnamese_Reranker and Vietnamese_Embedding_v2 was trained on 1100000 triplets. Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2, since this phase data is much larger, it is very good for other domains. You can access 2 model via link: Vietnamese_Embedding_v2, Vietnamese_Reranker You can reproduce the evaluation result by running code python evaluation_model.py (data downloaded from Kaggle). ## Contact Email: nguyennhotrung3004@gmail.com **Developer** Member: Nguyễn Nho Trung, Nguyễn Nhật Quang ## Citation", | |
| "model_explanation_gemini": "\"Fine-tuned from BGE-M3 to generate Vietnamese sentence embeddings for improved retrieval tasks, with a 1024-dimensional output and 2048-token sequence length.\"\n\n### Model Features: \n- **Base Model:** BAAI/bge-m3 \n- **Task:** Sentence similarity (dot product) \n- **Language:** Vietnamese \n- **Sequence Length:** 2048 tokens \n- **Output Dimension:** 1024 \n- **Training Data:** ~300K Vietnamese query-d", | |
| "release_year": null, | |
| "parameter_count": null, | |
| "is_fine_tuned": true, | |
| "category": "Embedding", | |
| "model_family": "BERT", | |
| "api_enhanced": true | |
| } |