SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the statictable-triplets-all dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-mini-v1-2")
# Run inference
sentences = [
    'Status pernikahan penduduk (10+) tiap provinsi, data 2012',
    'Persentase Penduduk Berumur 10 Tahun ke Atas menurut Provinsi, Jenis Kelamin, dan Status Perkawinan, 2009-2018',
    'Ekspor Batu Bara Menurut Negara Tujuan Utama, 2012-2023',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.899
cosine_accuracy@3 0.9739
cosine_accuracy@5 0.9805
cosine_accuracy@10 0.987
cosine_precision@1 0.899
cosine_precision@3 0.3518
cosine_precision@5 0.23
cosine_precision@10 0.1342
cosine_recall@1 0.7038
cosine_recall@3 0.7774
cosine_recall@5 0.7896
cosine_recall@10 0.8148
cosine_ndcg@10 0.8242
cosine_mrr@10 0.9362
cosine_map@100 0.7641

Training Details

Training Dataset

statictable-triplets-all

  • Dataset: statictable-triplets-all at 24979b4
  • Size: 967,831 training samples
  • Columns: query, pos, and neg
  • Approximate statistics based on the first 1000 samples:
    query pos neg
    type string string string
    details
    • min: 5 tokens
    • mean: 18.35 tokens
    • max: 37 tokens
    • min: 4 tokens
    • mean: 25.22 tokens
    • max: 58 tokens
    • min: 4 tokens
    • mean: 25.78 tokens
    • max: 58 tokens
  • Samples:
    query pos neg
    Jumlah bank dan kantor bank di Indonesia, 2010-2017 Bank dan Kantor Bank, 2010-2017 Rata-Rata Pengeluaran per Kapita Sebulan Menurut Kelompok Barang (rupiah), 1998-2012
    Konsumsi makanan mingguan per orang di Sulteng: beda tingkat pengeluaran (2021) Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Sulawesi Selatan, 2018-2023 IHK, Upah Nominal, Indeks Upah Nominal dan Riil Buruh Industri Berstatus di bawah Mandor Menurut Wilayah, 2008-2014 (2007=100)
    Impor semen Indonesia, negara asal utama, 2021 Impor Semen Menurut Negara Asal Utama, 2017-2023 Penerimaan dari Wisatawan Mancanegara Menurut Negara Tempat Tinggal (juta US$), 2000-2014
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

statictable-triplets-all

  • Dataset: statictable-triplets-all at 24979b4
  • Size: 967,831 evaluation samples
  • Columns: query, pos, and neg
  • Approximate statistics based on the first 1000 samples:
    query pos neg
    type string string string
    details
    • min: 5 tokens
    • mean: 18.39 tokens
    • max: 37 tokens
    • min: 4 tokens
    • mean: 25.22 tokens
    • max: 50 tokens
    • min: 4 tokens
    • mean: 25.33 tokens
    • max: 58 tokens
  • Samples:
    query pos neg
    Bagaimana hubungan antara bidang pekerjaan utama dan pendidikan pekerja 15+ di minggu lalu (tahun 2016)? Penduduk Berumur 15 Tahun Ke Atas yang Bekerja Selama Seminggu yang Lalu Menurut Lapangan Pekerjaan Utama dan Pendidikan Tertinggi yang Ditamatkan, 2008 - 2024 Bank dan Kantor Bank, 2010-2017
    Tren indikator kondisi perumahan, 2001 Indikator Perumahan 1993-2023 Banyaknya Desa/Kelurahan Menurut Keberadaan Kelompok Pertokoan, Pasar, dan Kios Sarana Produksi Pertanian (Saprotan), 2014 & 2018
    Gaji bersih rata-rata: Per pendidikan & lapangan kerja utama, Indonesia, 2021 Rata-rata Upah/Gaji Bersih sebulan Buruh/Karyawan Pegawai Menurut Pendidikan Tertinggi dan Lapangan Pekerjaan Utama, 2021 [Seri 2000] Laju Pertumbuhan Kumulatif PDB Menurut Lapangan Usaha (Persen), 2001-2014
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss bps-statictable-ir_cosine_ndcg@10
0 0 - 1.1084 0.4644
0.0070 20 1.0801 0.8303 0.5117
0.0139 40 0.6994 0.4459 0.6310
0.0209 60 0.3674 0.2510 0.7155
0.0278 80 0.2814 0.1829 0.7521
0.0348 100 0.1746 0.1303 0.7751
0.0418 120 0.1867 0.1001 0.7772
0.0487 140 0.1047 0.0819 0.7857
0.0557 160 0.1032 0.0739 0.7960
0.0626 180 0.0783 0.0645 0.7861
0.0696 200 0.0575 0.0567 0.7849
0.0765 220 0.0969 0.0454 0.7945
0.0835 240 0.0769 0.0433 0.7890
0.0905 260 0.0864 0.0507 0.7848
0.0974 280 0.0495 0.0347 0.8052
0.1044 300 0.0429 0.0398 0.7955
0.1113 320 0.0432 0.0343 0.7915
0.1183 340 0.0392 0.0295 0.8177
0.1253 360 0.0211 0.0298 0.8052
0.1322 380 0.043 0.0339 0.8052
0.1392 400 0.0453 0.0322 0.8050
0.1461 420 0.0309 0.0286 0.8120
0.1531 440 0.0147 0.0321 0.8181
0.1601 460 0.0491 0.0273 0.8178
0.1670 480 0.0229 0.0232 0.8176
0.1740 500 0.0317 0.0210 0.8198
0.1809 520 0.0193 0.0207 0.8159
0.1879 540 0.034 0.0175 0.8191
0.1949 560 0.0292 0.0168 0.8166
0.2018 580 0.0431 0.0184 0.8228
0.2088 600 0.0306 0.0183 0.7963
0.2157 620 0.0134 0.0147 0.8216
0.2227 640 0.0155 0.0161 0.8166
0.2296 660 0.0201 0.0187 0.8170
0.2366 680 0.0301 0.0133 0.8272
0.2436 700 0.0164 0.0119 0.8274
0.2505 720 0.0254 0.0119 0.8223
0.2575 740 0.0129 0.0146 0.8165
0.2644 760 0.0208 0.0136 0.8162
0.2714 780 0.0157 0.0138 0.8120
0.2784 800 0.0169 0.0143 0.8248
0.2853 820 0.0158 0.0119 0.8166
0.2923 840 0.0227 0.0115 0.8153
0.2992 860 0.0196 0.0117 0.8163
0.3062 880 0.0137 0.0112 0.8225
0.3132 900 0.0299 0.0090 0.8155
0.3201 920 0.0073 0.0106 0.8157
0.3271 940 0.0248 0.0088 0.8174
0.3340 960 0.0179 0.0087 0.8215
0.3410 980 0.0171 0.0077 0.8285
0.3479 1000 0.0123 0.0096 0.8175
0.3549 1020 0.0081 0.0098 0.8152
0.3619 1040 0.0097 0.0094 0.8139
0.3688 1060 0.0379 0.0107 0.8236
0.3758 1080 0.0104 0.0078 0.8208
0.3827 1100 0.0067 0.0065 0.8189
0.3897 1120 0.0128 0.0080 0.8221
0.3967 1140 0.0049 0.0078 0.8181
0.4036 1160 0.0084 0.0092 0.8218
0.4106 1180 0.0173 0.0081 0.8248
0.4175 1200 0.0144 0.0080 0.8272
0.4245 1220 0.0025 0.0077 0.8260
0.4315 1240 0.0086 0.0072 0.8312
0.4384 1260 0.0114 0.0073 0.8242
0.4454 1280 0.0065 0.0067 0.8245
0.4523 1300 0.0132 0.0069 0.8248
0.4593 1320 0.003 0.0066 0.8233
0.4662 1340 0.0125 0.0066 0.8245
0.4732 1360 0.0016 0.0070 0.8281
0.4802 1380 0.0041 0.0066 0.8418
0.4871 1400 0.0117 0.0073 0.8361
0.4941 1420 0.0095 0.0073 0.8337
0.5010 1440 0.0184 0.0071 0.8282
0.5080 1460 0.0042 0.0069 0.8259
0.5150 1480 0.0077 0.0065 0.8235
0.5219 1500 0.0213 0.0059 0.8209
0.5289 1520 0.0037 0.0059 0.8277
0.5358 1540 0.0053 0.0053 0.8186
0.5428 1560 0.0045 0.0071 0.8238
0.5498 1580 0.0013 0.0101 0.8257
0.5567 1600 0.017 0.0051 0.8292
0.5637 1620 0.0053 0.0045 0.8234
0.5706 1640 0.0077 0.0044 0.8235
0.5776 1660 0.0135 0.0046 0.8200
0.5846 1680 0.0013 0.0045 0.8242
0.5915 1700 0.0067 0.0048 0.8266
0.5985 1720 0.0154 0.0049 0.8232
0.6054 1740 0.0037 0.0048 0.8222
0.6124 1760 0.0012 0.0049 0.8232
0.6193 1780 0.0112 0.0051 0.8212
0.6263 1800 0.0173 0.0056 0.8228
0.6333 1820 0.0044 0.0059 0.8177
0.6402 1840 0.0193 0.0059 0.8197
0.6472 1860 0.0028 0.0060 0.8203
0.6541 1880 0.005 0.0054 0.8278
0.6611 1900 0.0077 0.0049 0.8227
0.6681 1920 0.0126 0.0040 0.8267
0.6750 1940 0.008 0.0039 0.8258
0.6820 1960 0.0131 0.0039 0.8251
0.6889 1980 0.0114 0.0042 0.8310
0.6959 2000 0.0083 0.0041 0.8314
0.7029 2020 0.006 0.0037 0.8303
0.7098 2040 0.0048 0.0036 0.8269
0.7168 2060 0.0165 0.0040 0.8262
0.7237 2080 0.0093 0.0035 0.8158
0.7307 2100 0.007 0.0031 0.8167
0.7376 2120 0.0065 0.0030 0.8248
0.7446 2140 0.0042 0.0029 0.8274
0.7516 2160 0.0111 0.0026 0.8258
0.7585 2180 0.0066 0.0028 0.8249
0.7655 2200 0.0034 0.0034 0.8244
0.7724 2220 0.0013 0.0033 0.8238
0.7794 2240 0.0025 0.0034 0.8253
0.7864 2260 0.0065 0.0034 0.8240
0.7933 2280 0.0049 0.0035 0.8258
0.8003 2300 0.0007 0.0035 0.8277
0.8072 2320 0.004 0.0034 0.8298
0.8142 2340 0.0013 0.0033 0.8293
0.8212 2360 0.0122 0.0032 0.8300
0.8281 2380 0.0008 0.0033 0.8285
0.8351 2400 0.0019 0.0032 0.8266
0.8420 2420 0.0033 0.0032 0.8266
0.8490 2440 0.0078 0.0024 0.8284
0.8559 2460 0.0087 0.0022 0.8272
0.8629 2480 0.003 0.0021 0.8255
0.8699 2500 0.0039 0.0021 0.8232
0.8768 2520 0.0054 0.0021 0.8225
0.8838 2540 0.0015 0.0021 0.8236
0.8907 2560 0.0043 0.0021 0.8245
0.8977 2580 0.0083 0.0022 0.8237
0.9047 2600 0.0029 0.0024 0.8233
0.9116 2620 0.0095 0.0025 0.8257
0.9186 2640 0.0013 0.0025 0.8263
0.9255 2660 0.0025 0.0025 0.8268
0.9325 2680 0.006 0.0025 0.8264
0.9395 2700 0.0078 0.0026 0.8247
0.9464 2720 0.0061 0.0025 0.8248
0.9534 2740 0.001 0.0025 0.8238
0.9603 2760 0.0041 0.0025 0.8233
0.9673 2780 0.0157 0.0024 0.8249
0.9743 2800 0.0039 0.0024 0.8248
0.9812 2820 0.0047 0.0024 0.8242
0.9882 2840 0.0058 0.0024 0.8243
0.9951 2860 0.0018 0.0024 0.8242
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
0
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for yahyaabd/allstats-search-mini-v1-2

Dataset used to train yahyaabd/allstats-search-mini-v1-2

Evaluation results