Instructions to use apsys/minang-embedder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use apsys/minang-embedder with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("apsys/minang-embedder", trust_remote_code=True) sentences = [ "Paliang suko bana makan siang di siko ayam jo ladonyo lamak bana harago lua biaso himat.", "I love having lunch here because the chicken and sambal are delicious and inexpensive.", "Kurang pas kalau bakunjuang ka banduang tanpa mancicipi batagor.", "The train ticket was booked yesterday." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Minang Embedder
Minang Embedder is a SentenceTransformers model fine-tuned for Minangkabau semantic similarity, retrieval, cross-lingual alignment, and code-switching robustness.
It is fine-tuned from jinaai/jina-embeddings-v5-text-nano-retrieval and maps text to 768-dimensional normalized embeddings.
Code, benchmark construction, result JSONs, and figures are available at:
https://github.com/menara-research/mina-recx
Model
- Model ID:
apsys/minang-embedder - Base model:
jinaai/jina-embeddings-v5-text-nano-retrieval - Architecture: SentenceTransformers Transformer + last-token pooling + normalization
- Embedding size: 768
- Primary language: Minangkabau (
min) - Additional alignment languages: Indonesian (
id) and English (en) - Intended tasks: semantic search, sentence similarity, bitext retrieval, low-resource evaluation
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("apsys/minang-embedder", trust_remote_code=True)
queries = [
"Paliang suko bana makan siang di siko ayam jo ladonyo lamak bana."
]
documents = [
"I love having lunch here because the chicken and sambal are delicious.",
"The train ticket was booked yesterday.",
"Kurang pas kalau bakunjuang ka banduang tanpa mancicipi batagor.",
]
query_embeddings = model.encode_query(queries, normalize_embeddings=True)
document_embeddings = model.encode_document(documents, normalize_embeddings=True)
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
Data Attribution
This model was fine-tuned and evaluated using locally generated training/evaluation artifacts derived from NusaX resources:
mteb/NusaXBitextMining, specificallyeng-minandeng-ind.mteb/nusa_x_senti, specificallymin,ind, andeng.- The original NusaX sentiment dataset is also available as
indonlp/NusaX-senti.
The NusaX dataset pages list the source data license as CC BY-SA 4.0. Users of this model should preserve NusaX attribution and respect the upstream dataset terms.
Primary NusaX reference:
@misc{winata2022nusax,
title={NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages},
author={Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and Baldwin, Timothy and Lau, Jey Han and Sennrich, Rico and Ruder, Sebastian},
year={2022},
eprint={2205.15960},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Training Data
The training pipeline builds contrastive examples from:
- English-Minangkabau translation pairs.
- English-Indonesian translation bridge pairs.
- Minangkabau same-sentiment sentence pairs.
- Minangkabau hard-negative different-sentiment pairs.
- Minangkabau-English and Minangkabau-Indonesian sentiment-aligned pairs.
- Synthetic Minangkabau/Indonesian code-switch pairs.
The local training artifacts contain:
| Artifact | Count |
|---|---|
| Generated training pairs | 4,096 |
| Contrastive examples with BM25 hard negatives | 3,308 |
Training used MultipleNegativesRankingLoss.
Benchmark
The local benchmark is MinSTS-Retrieval, generated by the project code in src/data/prepare_data.py.
| Section | Construction | Count |
|---|---|---|
| Monolingual retrieval | Minangkabau sentiment test text retrieves same-label Minangkabau texts | 400 queries, 400 corpus docs |
| Cross EN to MIN retrieval | English sentiment test text retrieves same-label Minangkabau texts | 400 queries |
| STS | Translation pairs, same-sentiment pairs, and different-sentiment pairs with heuristic similarity scores | 731 pairs |
| Cross-lingual bitext | Shared NusaX IDs for Minangkabau-English and Minangkabau-Indonesian | 400 each |
| Code-switching | Synthetic Minangkabau/Indonesian mixed text paired with original Minangkabau or English | 100 examples |
Important: MinSTS-Retrieval is a constructed benchmark from NusaX labels and alignments. It is not a manually human-annotated Minangkabau STS benchmark.
Results
Source files:
results/finetuned_minang-embedder.jsonresults/all_ablation_results.json- Figures and result tables: https://github.com/menara-research/mina-recx
Final model metrics:
| Metric | Value |
|---|---|
| STS Spearman | 0.7975 |
| Min-En Accuracy@1 | 0.7825 |
| Min-ID Accuracy@1 | 0.9075 |
| Monolingual Recall@10 | 0.0410 |
| Monolingual MRR@10 | 0.0760 |
| Cross-En Recall@10 | 0.0455 |
| Code-switch cosine | 0.6736 |
Best ablation from the local study was temp_0.2 on several aggregate metrics:
| Model | STS Spearman | Min-En Acc@1 | Min-ID Acc@1 | Mono R@10 | Mono MRR@10 | Cross-En R@10 | Code-switch Cosine |
|---|---|---|---|---|---|---|---|
| baseline | 0.4943 | 0.7025 | 0.9300 | 0.0400 | 0.0809 | 0.0510 | 0.7255 |
| temp_0.2 | 0.7992 | 0.8700 | 0.9450 | 0.0500 | 0.0902 | 0.0450 | 0.8618 |
| final export | 0.7975 | 0.7825 | 0.9075 | 0.0410 | 0.0760 | 0.0455 | 0.6736 |
Limitations
- The benchmark is synthetic/constructed from labels and alignments, not a direct human STS annotation effort.
- Retrieval relevance is approximated through sentiment labels, so retrieval scores measure label-coherent semantic grouping rather than exact document relevance.
- The model inherits constraints from the base Jina model and from the NusaX-derived training data.
- Performance should be validated on any production domain before deployment.
License
This derived model is released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Upstream terms:
- Base model
jinaai/jina-embeddings-v5-text-nano-retrieval: CC BY-NC 4.0. - NusaX-derived datasets used for training/evaluation: CC BY-SA 4.0 as listed on the Hugging Face dataset pages.
Users are responsible for complying with the upstream base-model and dataset licenses.
Citation
If you use this model, cite the code/artifact repository and the upstream NusaX and Jina resources:
@software{mina_recx_2026,
title={Minang Embedder: Minangkabau Embedding Benchmark Artifacts and Training Code},
author={Menara Research and apsys},
year={2026},
url={https://github.com/menara-research/mina-recx}
}
- Downloads last month
- -
Model tree for apsys/minang-embedder
Base model
EuroBERT/EuroBERT-210m