SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the query-hard-pos-neg-doc-pairs-statictable dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-miniLM-v1-5")
# Run inference
sentences = [
    'Arus dana Q3 2006',
    'Ringkasan Neraca Arus Dana, Triwulan III, 2006, (Miliar Rupiah)',
    'Rata-Rata Pengeluaran per Kapita Sebulan di Daerah Perkotaan Menurut Kelompok Barang dan Golongan Pengeluaran per Kapita Sebulan, 2000-2012',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Binary Classification

Metric allstats-semantic-mini-v1_test allstats-semantic-mini-v1_dev
cosine_accuracy 0.977 0.977
cosine_accuracy_threshold 0.747 0.747
cosine_f1 0.9649 0.9649
cosine_f1_threshold 0.7452 0.7452
cosine_precision 0.9553 0.9553
cosine_recall 0.9746 0.9746
cosine_ap 0.9927 0.9927
cosine_mcc 0.9479 0.9479

Training Details

Training Dataset

query-hard-pos-neg-doc-pairs-statictable

  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 25,580 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 7 tokens
    • mean: 20.14 tokens
    • max: 55 tokens
    • min: 5 tokens
    • mean: 24.9 tokens
    • max: 47 tokens
    • 0: ~70.80%
    • 1: ~29.20%
  • Samples:
    query doc label
    Status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    STATUS PEKERJAAN UTAMA PENDUDUK USIA 15+ YANG BEKERJA, 2020 Jumlah Penghuni Lapas per Kanwil 0
  • Loss: OnlineContrastiveLoss

Evaluation Dataset

query-hard-pos-neg-doc-pairs-statictable

  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 5,479 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 7 tokens
    • mean: 20.78 tokens
    • max: 52 tokens
    • min: 4 tokens
    • mean: 26.28 tokens
    • max: 43 tokens
    • 0: ~71.50%
    • 1: ~28.50%
  • Samples:
    query doc label
    Bagaimana perbandingan PNS pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    bagaimana perbandingan pns pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    BAGAIMANA PERBANDINGAN PNS PRIA DAN WANITA DI BERBAGAI GOLONGAN TAHUN 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 24
  • per_device_eval_batch_size: 24
  • num_train_epochs: 2
  • warmup_ratio: 0.2
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 24
  • per_device_eval_batch_size: 24
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.2
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss allstats-semantic-mini-v1_test_cosine_ap allstats-semantic-mini-v1_dev_cosine_ap
-1 -1 - - 0.8789 -
0 0 - 0.7267 - 0.8789
0.0188 20 0.668 0.6453 - 0.8848
0.0375 40 0.6117 0.4411 - 0.9003
0.0563 60 0.3108 0.3592 - 0.9130
0.0750 80 0.3824 0.2899 - 0.9336
0.0938 100 0.2118 0.2530 - 0.9442
0.1126 120 0.232 0.1945 - 0.9582
0.1313 140 0.1233 0.1663 - 0.9656
0.1501 160 0.1293 0.1655 - 0.9654
0.1689 180 0.0714 0.2142 - 0.9578
0.1876 200 0.1198 0.1455 - 0.9702
0.2064 220 0.1081 0.1258 - 0.9766
0.2251 240 0.0484 0.1210 - 0.9753
0.2439 260 0.1463 0.1100 - 0.9792
0.2627 280 0.0422 0.1228 - 0.9777
0.2814 300 0.1187 0.1302 - 0.9725
0.3002 320 0.0635 0.1257 - 0.9733
0.3189 340 0.0422 0.1125 - 0.9736
0.3377 360 0.0479 0.0882 - 0.9796
0.3565 380 0.119 0.1319 - 0.9697
0.3752 400 0.099 0.1445 - 0.9702
0.3940 420 0.0409 0.1434 - 0.9706
0.4128 440 0.1053 0.1520 - 0.9686
0.4315 460 0.1035 0.1382 - 0.9727
0.4503 480 0.0848 0.1150 - 0.9789
0.4690 500 0.0387 0.0944 - 0.9826
0.4878 520 0.0097 0.1041 - 0.9811
0.5066 540 0.0667 0.1041 - 0.9783
0.5253 560 0.1028 0.1386 - 0.9736
0.5441 580 0.0543 0.1350 - 0.9769
0.5629 600 0.0859 0.1254 - 0.9776
0.5816 620 0.0853 0.1483 - 0.9728
0.6004 640 0.024 0.1159 - 0.9781
0.6191 660 0.0762 0.1046 - 0.9784
0.6379 680 0.0433 0.1275 - 0.9686
0.6567 700 0.0772 0.0592 - 0.9882
0.6754 720 0.0185 0.0542 - 0.9889
0.6942 740 0.0376 0.1123 - 0.9801
0.7129 760 0.0612 0.1002 - 0.9817
0.7317 780 0.0156 0.0948 - 0.9809
0.7505 800 0.0474 0.0778 - 0.9817
0.7692 820 0.0427 0.0824 - 0.9828
0.7880 840 0.0289 0.0911 - 0.9833
0.8068 860 0.0175 0.0991 - 0.9827
0.8255 880 0.0241 0.0951 - 0.9824
0.8443 900 0.0527 0.0816 - 0.9860
0.8630 920 0.0535 0.0707 - 0.9875
0.8818 940 0.0211 0.0767 - 0.9868
0.9006 960 0.013 0.0758 - 0.9872
0.9193 980 0.0079 0.0781 - 0.9848
0.9381 1000 0.0406 0.0820 - 0.9845
0.9568 1020 0.0277 0.0685 - 0.9874
0.9756 1040 0.0132 0.0760 - 0.9859
0.9944 1060 0.0268 0.0881 - 0.9833
1.0131 1080 0.0089 0.0772 - 0.9857
1.0319 1100 0.0276 0.0773 - 0.9850
1.0507 1120 0.0181 0.0729 - 0.9860
1.0694 1140 0.0065 0.0683 - 0.9867
1.0882 1160 0.01 0.0639 - 0.9873
1.1069 1180 0.0068 0.0662 - 0.9870
1.1257 1200 0.0 0.0722 - 0.9863
1.1445 1220 0.0067 0.0710 - 0.9866
1.1632 1240 0.0069 0.0666 - 0.9877
1.1820 1260 0.0 0.0639 - 0.9880
1.2008 1280 0.0244 0.0610 - 0.9882
1.2195 1300 0.0143 0.0630 - 0.9877
1.2383 1320 0.0173 0.0530 - 0.9896
1.2570 1340 0.0171 0.0496 - 0.9907
1.2758 1360 0.0225 0.0521 - 0.9909
1.2946 1380 0.011 0.0569 - 0.9900
1.3133 1400 0.0088 0.0605 - 0.9898
1.3321 1420 0.0 0.0619 - 0.9897
1.3508 1440 0.0135 0.0608 - 0.9894
1.3696 1460 0.0 0.0593 - 0.9892
1.3884 1480 0.0145 0.0578 - 0.9894
1.4071 1500 0.0 0.0608 - 0.9896
1.4259 1520 0.0069 0.0567 - 0.9906
1.4447 1540 0.0 0.0561 - 0.9907
1.4634 1560 0.0224 0.0531 - 0.9912
1.4822 1580 0.0 0.0523 - 0.9911
1.5009 1600 0.0066 0.0503 - 0.9912
1.5197 1620 0.0 0.0472 - 0.9915
1.5385 1640 0.018 0.0452 - 0.9923
1.5572 1660 0.0117 0.0449 - 0.9925
1.5760 1680 0.0 0.0456 - 0.9925
1.5947 1700 0.0 0.0448 - 0.9925
1.6135 1720 0.0 0.0448 - 0.9925
1.6323 1740 0.0072 0.0458 - 0.9924
1.6510 1760 0.0 0.0456 - 0.9923
1.6698 1780 0.0163 0.0482 - 0.9925
1.6886 1800 0.0063 0.0463 - 0.9926
1.7073 1820 0.0078 0.0482 - 0.9925
1.7261 1840 0.0179 0.0472 - 0.9927
1.7448 1860 0.0 0.0477 - 0.9927
1.7636 1880 0.0 0.0477 - 0.9927
1.7824 1900 0.0065 0.0461 - 0.9926
1.8011 1920 0.0077 0.0458 - 0.9926
1.8199 1940 0.0065 0.0453 - 0.9927
1.8386 1960 0.0 0.0451 - 0.9927
1.8574 1980 0.0 0.0451 - 0.9927
1.8762 2000 0.0 0.0451 - 0.9927
1.8949 2020 0.0 0.0451 - 0.9927
1.9137 2040 0.0 0.0451 - 0.9927
1.9325 2060 0.0 0.0451 - 0.9927
1.9512 2080 0.0 0.0451 - 0.9927
1.9700 2100 0.007 0.0442 - 0.9927
1.9887 2120 0.0067 0.0441 - 0.9927
-1 -1 - - 0.9927 -
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
5
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for yahyaabd/allstats-search-miniLM-v1-5

Dataset used to train yahyaabd/allstats-search-miniLM-v1-5

Evaluation results

  • Cosine Accuracy on allstats semantic mini v1 test
    self-reported
    0.977
  • Cosine Accuracy Threshold on allstats semantic mini v1 test
    self-reported
    0.747
  • Cosine F1 on allstats semantic mini v1 test
    self-reported
    0.965
  • Cosine F1 Threshold on allstats semantic mini v1 test
    self-reported
    0.745
  • Cosine Precision on allstats semantic mini v1 test
    self-reported
    0.955
  • Cosine Recall on allstats semantic mini v1 test
    self-reported
    0.975
  • Cosine Ap on allstats semantic mini v1 test
    self-reported
    0.993
  • Cosine Mcc on allstats semantic mini v1 test
    self-reported
    0.948
  • Cosine Accuracy on allstats semantic mini v1 dev
    self-reported
    0.977
  • Cosine Accuracy Threshold on allstats semantic mini v1 dev
    self-reported
    0.747