Sentence Similarity
Transformers
Safetensors
multilingual
xlm-roberta
feature-extraction
embeddings
text-embeddings-inference
bge
bge-m3
retrieval
semantic-search
custom-tokenizer
long-context
Instructions to use Adc05102002/bge-m3-vi-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Adc05102002/bge-m3-vi-base with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Adc05102002/bge-m3-vi-base") model = AutoModel.from_pretrained("Adc05102002/bge-m3-vi-base") - Notebooks
- Google Colab
- Kaggle
BGE-M3 Custom Tokenizer (8.5K Vocab)
A customized version of :contentReference[oaicite:0]{index=0} with a newly trained tokenizer optimized for domain-specific multilingual retrieval workloads.
This model replaces the original XLM-R tokenizer vocabulary with a compact 8.5K-token tokenizer trained on a custom corpus.
Highlights
- Based on
BAAI/bge-m3 - Custom tokenizer trained from scratch
- Reduced vocabulary size: 8500
- Long-context support: 8192 tokens
- Multilingual retrieval and embedding model
- Optimized for:
- semantic search
- RAG pipelines
- dense retrieval
- domain-specific embeddings
Model Details
Base Model
- Architecture: XLM-RoBERTa
- Original model:
BAAI/bge-m3 - Embedding dimension: 1024
- Transformer encoder model
Tokenizer
The original tokenizer was replaced with a newly trained tokenizer using:
tokenizer = base_tokenizer.train_new_from_iterator(
batch_iterator(),
vocab_size=8500,
min_frequency=2,
)
- Downloads last month
- 269