Minang Embedder

Minang Embedder is a SentenceTransformers model fine-tuned for Minangkabau semantic similarity, retrieval, cross-lingual alignment, and code-switching robustness.

It is fine-tuned from jinaai/jina-embeddings-v5-text-nano-retrieval and maps text to 768-dimensional normalized embeddings.

Code, benchmark construction, result JSONs, and figures are available at:

https://github.com/menara-research/mina-recx

Model

Model ID: apsys/minang-embedder
Base model: jinaai/jina-embeddings-v5-text-nano-retrieval
Architecture: SentenceTransformers Transformer + last-token pooling + normalization
Embedding size: 768
Primary language: Minangkabau (min)
Additional alignment languages: Indonesian (id) and English (en)
Intended tasks: semantic search, sentence similarity, bitext retrieval, low-resource evaluation

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("apsys/minang-embedder", trust_remote_code=True)

queries = [
    "Paliang suko bana makan siang di siko ayam jo ladonyo lamak bana."
]
documents = [
    "I love having lunch here because the chicken and sambal are delicious.",
    "The train ticket was booked yesterday.",
    "Kurang pas kalau bakunjuang ka banduang tanpa mancicipi batagor.",
]

query_embeddings = model.encode_query(queries, normalize_embeddings=True)
document_embeddings = model.encode_document(documents, normalize_embeddings=True)
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)

Data Attribution

This model was fine-tuned and evaluated using locally generated training/evaluation artifacts derived from NusaX resources:

mteb/NusaXBitextMining, specifically eng-min and eng-ind.
mteb/nusa_x_senti, specifically min, ind, and eng.
The original NusaX sentiment dataset is also available as indonlp/NusaX-senti.

The NusaX dataset pages list the source data license as CC BY-SA 4.0. Users of this model should preserve NusaX attribution and respect the upstream dataset terms.

Primary NusaX reference:

@misc{winata2022nusax,
  title={NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages},
  author={Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and Baldwin, Timothy and Lau, Jey Han and Sennrich, Rico and Ruder, Sebastian},
  year={2022},
  eprint={2205.15960},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Training Data

The training pipeline builds contrastive examples from:

English-Minangkabau translation pairs.
English-Indonesian translation bridge pairs.
Minangkabau same-sentiment sentence pairs.
Minangkabau hard-negative different-sentiment pairs.
Minangkabau-English and Minangkabau-Indonesian sentiment-aligned pairs.
Synthetic Minangkabau/Indonesian code-switch pairs.

The local training artifacts contain:

Artifact	Count
Generated training pairs	4,096
Contrastive examples with BM25 hard negatives	3,308

Training used MultipleNegativesRankingLoss.

Benchmark

The local benchmark is MinSTS-Retrieval, generated by the project code in src/data/prepare_data.py.

Section	Construction	Count
Monolingual retrieval	Minangkabau sentiment test text retrieves same-label Minangkabau texts	400 queries, 400 corpus docs
Cross EN to MIN retrieval	English sentiment test text retrieves same-label Minangkabau texts	400 queries
STS	Translation pairs, same-sentiment pairs, and different-sentiment pairs with heuristic similarity scores	731 pairs
Cross-lingual bitext	Shared NusaX IDs for Minangkabau-English and Minangkabau-Indonesian	400 each
Code-switching	Synthetic Minangkabau/Indonesian mixed text paired with original Minangkabau or English	100 examples

Important: MinSTS-Retrieval is a constructed benchmark from NusaX labels and alignments. It is not a manually human-annotated Minangkabau STS benchmark.

Results

Source files:

results/finetuned_minang-embedder.json
results/all_ablation_results.json
Figures and result tables: https://github.com/menara-research/mina-recx

Final model metrics:

Metric	Value
STS Spearman	0.7975
Min-En Accuracy@1	0.7825
Min-ID Accuracy@1	0.9075
Monolingual Recall@10	0.0410
Monolingual MRR@10	0.0760
Cross-En Recall@10	0.0455
Code-switch cosine	0.6736

Best ablation from the local study was temp_0.2 on several aggregate metrics:

Model	STS Spearman	Min-En Acc@1	Min-ID Acc@1	Mono R@10	Mono MRR@10	Cross-En R@10	Code-switch Cosine
baseline	0.4943	0.7025	0.9300	0.0400	0.0809	0.0510	0.7255
temp_0.2	0.7992	0.8700	0.9450	0.0500	0.0902	0.0450	0.8618
final export	0.7975	0.7825	0.9075	0.0410	0.0760	0.0455	0.6736

Limitations

The benchmark is synthetic/constructed from labels and alignments, not a direct human STS annotation effort.
Retrieval relevance is approximated through sentiment labels, so retrieval scores measure label-coherent semantic grouping rather than exact document relevance.
The model inherits constraints from the base Jina model and from the NusaX-derived training data.
Performance should be validated on any production domain before deployment.

License

This derived model is released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Upstream terms:

Base model jinaai/jina-embeddings-v5-text-nano-retrieval: CC BY-NC 4.0.
NusaX-derived datasets used for training/evaluation: CC BY-SA 4.0 as listed on the Hugging Face dataset pages.

Users are responsible for complying with the upstream base-model and dataset licenses.

Citation

If you use this model, cite the code/artifact repository and the upstream NusaX and Jina resources:

@software{mina_recx_2026,
  title={Minang Embedder: Minangkabau Embedding Benchmark Artifacts and Training Code},
  author={Menara Research and apsys},
  year={2026},
  url={https://github.com/menara-research/mina-recx}
}