Spaces:

shayan5422
/

back_rag_huggingface

Paused

App Files Files Community

back_rag_huggingface / model_data_json /AITeamVN_Vietnamese_Embedding.json

shayan5422

Upload 3710 files

21cad66 verified 7 months ago

raw

history blame contribute delete

3.11 kB

	{
	"model_id": "AITeamVN/Vietnamese_Embedding",
	"downloads": 13362,
	"tags": [
	"sentence-transformers",
	"safetensors",
	"xlm-roberta",
	"Embedding",
	"sentence-similarity",
	"vi",
	"base_model:BAAI/bge-m3",
	"base_model:finetune:BAAI/bge-m3",
	"license:apache-2.0",
	"autotrain_compatible",
	"text-embeddings-inference",
	"endpoints_compatible",
	"region:us"
	],
	"description": "--- license: apache-2.0 language: - vi base_model: - BAAI/bge-m3 pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - Embedding --- ## Model Card: Vietnamese_Embedding Vietnamese_Embedding is an embedding model fine-tuned from the BGE-M3 model ( to enhance retrieval capabilities for Vietnamese. * The model was trained on approximately 300,000 triplets of queries, positive documents, and negative documents for Vietnamese. * The model was trained with a maximum sequence length of 2048. ## Model Details ### Model Description - Model Type: Sentence Transformer - Base model: BAAI/bge-m3 - Maximum Sequence Length: 2048 tokens - Output Dimensionality: 1024 dimensions - Similarity Function: Dot product Similarity - Language: Vietnamese - Licence: Apache 2.0 ## Usage ### Evaluation: - Dataset: Entire training dataset of Legal Zalo 2021. Our model was not trained on this dataset. \| Model \| Accuracy@1 \| Accuracy@3 \| Accuracy@5 \| Accuracy@10 \| MRR@10 \| \|----------------------\|------------\|------------\|------------\|-------------\|--------------\| \| Vietnamese_Reranker \| 0.7944 \| 0.9324 \| 0.9537 \| 0.9740 \| 0.8672 \| \| Vietnamese_Embedding_v2 \| 0.7262 \| 0.8927 \| 0.9268 \| 0.9578 \| 0.8149 \| \| Vietnamese_Embedding (public) \| 0.7274 \| 0.8992 \| 0.9305 \| 0.9568 \| 0.8181 \| \| Vietnamese-bi-encoder (BKAI) \| 0.7109 \| 0.8680 \| 0.9014 \| 0.9299 \| 0.7951 \| \| BGE-M3 \| 0.5682 \| 0.7728 \| 0.8382 \| 0.8921 \| 0.6822 \| Vietnamese_Reranker and Vietnamese_Embedding_v2 was trained on 1100000 triplets. Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2, since this phase data is much larger, it is very good for other domains. You can access 2 model via link: Vietnamese_Embedding_v2, Vietnamese_Reranker You can reproduce the evaluation result by running code python evaluation_model.py (data downloaded from Kaggle). ## Contact Email: nguyennhotrung3004@gmail.com Developer Member: Nguyễn Nho Trung, Nguyễn Nhật Quang ## Citation",
	"model_explanation_gemini": "\"Fine-tuned from BGE-M3 to generate Vietnamese sentence embeddings for improved retrieval tasks, with a 1024-dimensional output and 2048-token sequence length.\"\n\n### Model Features: \n- Base Model: BAAI/bge-m3 \n- Task: Sentence similarity (dot product) \n- Language: Vietnamese \n- Sequence Length: 2048 tokens \n- Output Dimension: 1024 \n- Training Data: ~300K Vietnamese query-d",
	"release_year": null,
	"parameter_count": null,
	"is_fine_tuned": true,
	"category": "Embedding",
	"model_family": "BERT",
	"api_enhanced": true
	}