BGE-M3 Custom Tokenizer (8.5K Vocab)

A customized version of :contentReference[oaicite:0]{index=0} with a newly trained tokenizer optimized for domain-specific multilingual retrieval workloads.

This model replaces the original XLM-R tokenizer vocabulary with a compact 8.5K-token tokenizer trained on a custom corpus.

Highlights

  • Based on BAAI/bge-m3
  • Custom tokenizer trained from scratch
  • Reduced vocabulary size: 8500
  • Long-context support: 8192 tokens
  • Multilingual retrieval and embedding model
  • Optimized for:
    • semantic search
    • RAG pipelines
    • dense retrieval
    • domain-specific embeddings

Model Details

Base Model

  • Architecture: XLM-RoBERTa
  • Original model: BAAI/bge-m3
  • Embedding dimension: 1024
  • Transformer encoder model

Tokenizer

The original tokenizer was replaced with a newly trained tokenizer using:

tokenizer = base_tokenizer.train_new_from_iterator(
    batch_iterator(),
    vocab_size=8500,
    min_frequency=2,
)
Downloads last month
269
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Adc05102002/bge-m3-vi-base

Base model

BAAI/bge-m3
Finetuned
(470)
this model
Finetunes
1 model