SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the query-pos-neg-doc-pairs-statictable dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-miniLM-v1-7")
# Run inference
sentences = [
    'Aliran dana Rupiah: Q1 2008',
    'IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor (Supervisor), 2012-2014 (2012=100)',
    'Ringkasan Neraca Arus Dana, 2012 (Miliar Rupiah)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Binary Classification

Metric allstats-semantic-mini-v1_test allstats-semantic-mini-v1_dev
cosine_accuracy 0.9679 0.9678
cosine_accuracy_threshold 0.7482 0.7902
cosine_f1 0.9678 0.9674
cosine_f1_threshold 0.7444 0.7875
cosine_precision 0.9596 0.9617
cosine_recall 0.9762 0.9731
cosine_ap 0.9922 0.993
cosine_mcc 0.9359 0.9357

Training Details

Training Dataset

query-pos-neg-doc-pairs-statictable

  • Dataset: query-pos-neg-doc-pairs-statictable at a31b58d
  • Size: 110,773 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 9 tokens
    • mean: 21.22 tokens
    • max: 50 tokens
    • min: 6 tokens
    • mean: 28.24 tokens
    • max: 50 tokens
    • 0: ~43.90%
    • 1: ~56.10%
  • Samples:
    query doc label
    Data orang yang naik/turun kapal, di pelabuhan yang dikelola maupun tidak, sekitar 2015 Tabel Input-Output Indonesia Transaksi Total Atas Dasar Harga Dasar (185 Produk), 2016 (Juta Rupiah) 0
    data orang yang naik/turun kapal, di pelabuhan yang dikelola maupun tidak, sekitar 2015 Tabel Input-Output Indonesia Transaksi Total Atas Dasar Harga Dasar (185 Produk), 2016 (Juta Rupiah) 0
    DATA ORANG YANG NAIK/TURUN KAPAL, DI PELABUHAN YANG DIKELOLA MAUPUN TIDAK, SEKITAR 2015 Tabel Input-Output Indonesia Transaksi Total Atas Dasar Harga Dasar (185 Produk), 2016 (Juta Rupiah) 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Evaluation Dataset

query-pos-neg-doc-pairs-statictable

  • Dataset: query-pos-neg-doc-pairs-statictable at a31b58d
  • Size: 23,763 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 7 tokens
    • mean: 20.75 tokens
    • max: 57 tokens
    • min: 6 tokens
    • mean: 27.44 tokens
    • max: 43 tokens
    • 0: ~50.20%
    • 1: ~49.80%
  • Samples:
    query doc label
    Cek penghasilan bulanan (gaji bersih) buruh/pegawai, per provinsi dan jenis pekerjaannya, 2019 Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama, 2021 1
    cek penghasilan bulanan (gaji bersih) buruh/pegawai, per provinsi dan jenis pekerjaannya, 2019 Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama, 2021 1
    CEK PENGHASILAN BULANAN (GAJI BERSIH) BURUH/PEGAWAI, PER PROVINSI DAN JENIS PEKERJAANNYA, 2019 Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama, 2021 1
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 1
  • warmup_ratio: 0.2
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.2
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-mini-v1_test_cosine_ap allstats-semantic-mini-v1_dev_cosine_ap
-1 -1 - - 0.8699 -
0 0 - 0.0489 - 0.8658
0.0578 100 0.0222 0.0101 - 0.9458
0.1155 200 0.0087 0.0073 - 0.9631
0.1733 300 0.007 0.0059 - 0.9710
0.2311 400 0.0056 0.0049 - 0.9828
0.2889 500 0.0045 0.0044 - 0.9837
0.3466 600 0.0042 0.0041 - 0.9862
0.4044 700 0.0038 0.0038 - 0.9888
0.4622 800 0.0037 0.0037 - 0.9890
0.5199 900 0.0029 0.0036 - 0.9889
0.5777 1000 0.0031 0.0034 - 0.9907
0.6355 1100 0.0029 0.0033 - 0.9923
0.6932 1200 0.0025 0.0034 - 0.9922
0.7510 1300 0.0025 0.0033 - 0.9929
0.8088 1400 0.0024 0.0033 - 0.9928
0.8666 1500 0.0022 0.0033 - 0.9926
0.9243 1600 0.0023 0.0033 - 0.9929
0.9821 1700 0.0022 0.0032 - 0.993
-1 -1 - - 0.9922 -
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}
Downloads last month
3
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yahyaabd/allstats-search-miniLM-v1-7

Dataset used to train yahyaabd/allstats-search-miniLM-v1-7

Evaluation results

  • Cosine Accuracy on allstats semantic mini v1 test
    self-reported
    0.968
  • Cosine Accuracy Threshold on allstats semantic mini v1 test
    self-reported
    0.748
  • Cosine F1 on allstats semantic mini v1 test
    self-reported
    0.968
  • Cosine F1 Threshold on allstats semantic mini v1 test
    self-reported
    0.744
  • Cosine Precision on allstats semantic mini v1 test
    self-reported
    0.960
  • Cosine Recall on allstats semantic mini v1 test
    self-reported
    0.976
  • Cosine Ap on allstats semantic mini v1 test
    self-reported
    0.992
  • Cosine Mcc on allstats semantic mini v1 test
    self-reported
    0.936
  • Cosine Accuracy on allstats semantic mini v1 dev
    self-reported
    0.968
  • Cosine Accuracy Threshold on allstats semantic mini v1 dev
    self-reported
    0.790