Minang Embedder

Minang Embedder is a SentenceTransformers model fine-tuned for Minangkabau semantic similarity, retrieval, cross-lingual alignment, and code-switching robustness.

It is fine-tuned from jinaai/jina-embeddings-v5-text-nano-retrieval and maps text to 768-dimensional normalized embeddings.

Code, benchmark construction, result JSONs, and figures are available at:

https://github.com/menara-research/mina-recx

Model

  • Model ID: apsys/minang-embedder
  • Base model: jinaai/jina-embeddings-v5-text-nano-retrieval
  • Architecture: SentenceTransformers Transformer + last-token pooling + normalization
  • Embedding size: 768
  • Primary language: Minangkabau (min)
  • Additional alignment languages: Indonesian (id) and English (en)
  • Intended tasks: semantic search, sentence similarity, bitext retrieval, low-resource evaluation

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("apsys/minang-embedder", trust_remote_code=True)

queries = [
    "Paliang suko bana makan siang di siko ayam jo ladonyo lamak bana."
]
documents = [
    "I love having lunch here because the chicken and sambal are delicious.",
    "The train ticket was booked yesterday.",
    "Kurang pas kalau bakunjuang ka banduang tanpa mancicipi batagor.",
]

query_embeddings = model.encode_query(queries, normalize_embeddings=True)
document_embeddings = model.encode_document(documents, normalize_embeddings=True)
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)

Data Attribution

This model was fine-tuned and evaluated using locally generated training/evaluation artifacts derived from NusaX resources:

The NusaX dataset pages list the source data license as CC BY-SA 4.0. Users of this model should preserve NusaX attribution and respect the upstream dataset terms.

Primary NusaX reference:

@misc{winata2022nusax,
  title={NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages},
  author={Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and Baldwin, Timothy and Lau, Jey Han and Sennrich, Rico and Ruder, Sebastian},
  year={2022},
  eprint={2205.15960},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Training Data

The training pipeline builds contrastive examples from:

  • English-Minangkabau translation pairs.
  • English-Indonesian translation bridge pairs.
  • Minangkabau same-sentiment sentence pairs.
  • Minangkabau hard-negative different-sentiment pairs.
  • Minangkabau-English and Minangkabau-Indonesian sentiment-aligned pairs.
  • Synthetic Minangkabau/Indonesian code-switch pairs.

The local training artifacts contain:

Artifact Count
Generated training pairs 4,096
Contrastive examples with BM25 hard negatives 3,308

Training used MultipleNegativesRankingLoss.

Benchmark

The local benchmark is MinSTS-Retrieval, generated by the project code in src/data/prepare_data.py.

Section Construction Count
Monolingual retrieval Minangkabau sentiment test text retrieves same-label Minangkabau texts 400 queries, 400 corpus docs
Cross EN to MIN retrieval English sentiment test text retrieves same-label Minangkabau texts 400 queries
STS Translation pairs, same-sentiment pairs, and different-sentiment pairs with heuristic similarity scores 731 pairs
Cross-lingual bitext Shared NusaX IDs for Minangkabau-English and Minangkabau-Indonesian 400 each
Code-switching Synthetic Minangkabau/Indonesian mixed text paired with original Minangkabau or English 100 examples

Important: MinSTS-Retrieval is a constructed benchmark from NusaX labels and alignments. It is not a manually human-annotated Minangkabau STS benchmark.

Results

Source files:

Final model metrics:

Metric Value
STS Spearman 0.7975
Min-En Accuracy@1 0.7825
Min-ID Accuracy@1 0.9075
Monolingual Recall@10 0.0410
Monolingual MRR@10 0.0760
Cross-En Recall@10 0.0455
Code-switch cosine 0.6736

Best ablation from the local study was temp_0.2 on several aggregate metrics:

Model STS Spearman Min-En Acc@1 Min-ID Acc@1 Mono R@10 Mono MRR@10 Cross-En R@10 Code-switch Cosine
baseline 0.4943 0.7025 0.9300 0.0400 0.0809 0.0510 0.7255
temp_0.2 0.7992 0.8700 0.9450 0.0500 0.0902 0.0450 0.8618
final export 0.7975 0.7825 0.9075 0.0410 0.0760 0.0455 0.6736

Limitations

  • The benchmark is synthetic/constructed from labels and alignments, not a direct human STS annotation effort.
  • Retrieval relevance is approximated through sentiment labels, so retrieval scores measure label-coherent semantic grouping rather than exact document relevance.
  • The model inherits constraints from the base Jina model and from the NusaX-derived training data.
  • Performance should be validated on any production domain before deployment.

License

This derived model is released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Upstream terms:

Users are responsible for complying with the upstream base-model and dataset licenses.

Citation

If you use this model, cite the code/artifact repository and the upstream NusaX and Jina resources:

@software{mina_recx_2026,
  title={Minang Embedder: Minangkabau Embedding Benchmark Artifacts and Training Code},
  author={Menara Research and apsys},
  year={2026},
  url={https://github.com/menara-research/mina-recx}
}
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for apsys/minang-embedder

Datasets used to train apsys/minang-embedder

Paper for apsys/minang-embedder