metadata

base_model: vinai/phobert-base-v2
datasets: []
language: []
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:102178
  - loss:TripletLoss
widget:
  - source_sentence: >-
      Bàn cho thấy các thiết_kế và sản_xuất kiến_thức cần_thiết để thực_hiện
      nhiều quyết_định thông_báo hơn .
    sentences:
      - Nixon quyết_định rằng hồ chí minh có_thể ở lại miền nam Việt_Nam .
      - Không có gì cần_thiết để đưa ra một quyết_định thông_tin .
      - >-
        Bảng Hiển_thị thiết_kế và sản_xuất thông_tin cần_thiết để đưa ra
        quyết_định .
  - source_sentence: 95 gói nước_tiểu miễn_phí trong túi của họ .
    sentences:
      - Tây_ban nha trượt từ vị_trí quyền_lực của họ .
      - >-
        Đội đã bước vào phòng thí_nghiệm mang theo tổng_cộng 99 đơn_vị
        trong_sạch , thử_nghiệm thân_thiện .
      - >-
        Túi được yêu_cầu cho nhà toàn_bộ 95 đơn_vị phục_vụ trong_sạch nước_tiểu
        giữa các nhà cung_cấp các sản_phẩm .
  - source_sentence: >-
      Tuyển một chiếc xe rất đắt tiền , và những gì có để xem_thường là gần
      những con đường chính .
    sentences:
      - >-
        Thuê một chiếc xe rất rẻ nhưng có_thể không đáng_giá_như những cảnh_sát
        ở xa con đường .
      - Có một nhà_thờ hình_tròn ở orangerie ở Paris .
      - >-
        Thuê một chiếc xe đến với chi_phí lớn và hầu_hết các điểm đến đều gần
        đường .
  - source_sentence: Người da đen là 12 phần_trăm dân_số .
    sentences:
      - Người da đen tạo ra 50 % tổng_số dân_số .
      - Người Mỹ Châu_Phi là một nhóm_thiểu_số .
      - Tôi đoán là barney fife .
  - source_sentence: >-
      Báo đen đã editorialized chống lại những cuộc viếng_thăm của farrakhan với
      các nhà độc_tài châu phi .
    sentences:
      - Báo đen đã viết về quá_khứ của farrakhan .
      - Khi bạn đi đến radda , bạn nên kiểm_tra piccolo bảo del chianti .
      - Báo đen từ_chối yểm_trợ cho farrakhan .
model-index:
  - name: SentenceTransformer based on vinai/phobert-base-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts dev
          type: sts-dev
        metrics:
          - type: pearson_cosine
            value: 0.42030854811305457
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.5147968030818376
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.5605026901702432
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.5792048311109484
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.4710386131519505
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.5087153254455983
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.3923969498466928
            name: Pearson Dot
          - type: spearman_dot
            value: 0.4338097270757405
            name: Spearman Dot
          - type: pearson_max
            value: 0.5605026901702432
            name: Pearson Max
          - type: spearman_max
            value: 0.5792048311109484
            name: Spearman Max

SentenceTransformer based on vinai/phobert-base-v2

This is a sentence-transformers model finetuned from vinai/phobert-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: vinai/phobert-base-v2
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("huudan123/stage1")
# Run inference
sentences = [
    'Báo đen đã editorialized chống lại những cuộc viếng_thăm của farrakhan với các nhà độc_tài châu phi .',
    'Báo đen đã viết về quá_khứ của farrakhan .',
    'Báo đen từ_chối yểm_trợ cho farrakhan .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Dataset: sts-dev
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.4203
spearman_cosine	0.5148
pearson_manhattan	0.5605
spearman_manhattan	0.5792
pearson_euclidean	0.471
spearman_euclidean	0.5087
pearson_dot	0.3924
spearman_dot	0.4338
pearson_max	0.5605
spearman_max	0.5792

Training Details

Training Dataset

Unnamed Dataset

Size: 102,178 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 4 tokens mean: 27.28 tokens max: 147 tokens	min: 4 tokens mean: 14.99 tokens max: 44 tokens	min: 4 tokens mean: 14.34 tokens max: 34 tokens

Samples:

anchor	positive	negative
`Tem đầy màu_sắc của madeira , cũng như tiền xu , ghi_chép ngân_hàng , và các mặt_hàng khác như bưu_thiếp là mối quan_tâm đến nhiều nhà sưu_tập .`	`Các nhà sưu_tập sẽ thích ghé thăm madeira bởi_vì những phân_chia lớn của tem , ghi_chép ngân_hàng , bưu_thiếp , và nhiều mặt_hàng khác họ có_thể đọc được .`	`Mọi người quan_tâm đến việc bắt_đầu bộ sưu_tập mới nên thoát madeira và đi du_lịch phía bắc , nơi họ có khả_năng tìm thấy các cửa_hàng tốt .`
`Cẩn_thận đấy , ông inglethorp . Poirot bị bồn_chồn .`	`Hãy chăm_sóc ông inglethorp .`	`Không cần phải cẩn_thận với anh ta .`
`Phải có một_chút hoài_nghi về trải nghiệm cá_nhân của sperling với trò_chơi .`	`Hãy suy_nghĩ về những tác_động khi nhìn vào kinh_nghiệm của anh ấy .`	`Một người có_thể lấy trải nghiệm cá_nhân của sperling với giá_trị mặt .`

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

Evaluation Dataset

Unnamed Dataset

Size: 12,772 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 4 tokens mean: 27.81 tokens max: 164 tokens	min: 3 tokens mean: 14.94 tokens max: 42 tokens	min: 4 tokens mean: 14.4 tokens max: 39 tokens

Samples:

anchor	positive	negative
`Tình_yêu , anh có muốn em trở_thành kassandra lubbock của anh không ?`	`Tôi có_thể là kassandra lubbock của anh .`	`Tôi từ_chối trở_thành kassandra lubbock của anh .`
`Ví_dụ , trong mùa thu năm 1997 , ủy ban điều_trị hạt_nhân ( nrc ) văn_phòng thanh_tra tướng liệu nrc để có được quan_điểm của họ trên văn_hóa an_toàn của đại_lý .`	`Nhân_viên nrc đã được hỏi về quan_điểm của họ trên văn_hóa an_toàn của đại_lý .`	`Các nhân_viên không bao_giờ quan_sát về quan_điểm của họ về văn_hóa an_toàn của đại_lý trong mùa thu năm 1997 .`
`Mỗi năm kem của trẻ nghệ và comedic tài_năng làm cho nó đường đến edinburgh , và fringe đã lớn lên trong việc huấn_luyện lớn nhất trong khung_cảnh lớn nhất cho các diễn_viên phát_triển trên thế_giới .`	`Tài_năng mới đến edinburgh .`	`Tài_năng mới đến dublin .`

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

Training Hyperparameters

Non-Default Hyperparameters

overwrite_output_dir: True
eval_strategy: epoch
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
num_train_epochs: 20
lr_scheduler_type: cosine
warmup_ratio: 0.05
fp16: True
load_best_model_at_end: True
gradient_checkpointing: True

All Hyperparameters

Click to expand

overwrite_output_dir: True
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 20
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: {}
warmup_ratio: 0.05
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: True
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	loss	sts-dev_spearman_cosine
0	0	-	-	0.6643
0.0626	50	4.6946	-	-
0.1252	100	4.031	-	-
0.1877	150	2.7654	-	-
0.2503	200	2.4176	-	-
0.3129	250	2.1111	-	-
0.3755	300	2.0263	-	-
0.4380	350	1.9296	-	-
0.5006	400	1.7793	-	-
0.5632	450	1.7903	-	-
0.6258	500	1.7638	-	-
0.6884	550	1.7042	-	-
0.7509	600	1.7038	-	-
0.8135	650	1.6221	-	-
0.8761	700	1.6172	-	-
0.9387	750	1.6227	-	-
1.0	799	-	1.5275	0.5219
1.0013	800	1.6264	-	-
1.0638	850	1.364	-	-
1.1264	900	1.4447	-	-
1.1890	950	1.4161	-	-
1.2516	1000	1.3575	-	-
1.3141	1050	1.3554	-	-
1.3767	1100	1.378	-	-
1.4393	1150	1.3806	-	-
1.5019	1200	1.3089	-	-
1.5645	1250	1.4314	-	-
1.6270	1300	1.3672	-	-
1.6896	1350	1.3777	-	-
1.7522	1400	1.3282	-	-
1.8148	1450	1.3432	-	-
1.8773	1500	1.3101	-	-
1.9399	1550	1.2919	-	-
2.0	1598	-	1.3643	0.5667
2.0025	1600	1.2969	-	-
2.0651	1650	0.9629	-	-
2.1277	1700	0.9878	-	-
2.1902	1750	0.9437	-	-
2.2528	1800	0.9832	-	-
2.3154	1850	0.9584	-	-
2.3780	1900	1.0689	-	-
2.4406	1950	1.0579	-	-
2.5031	2000	0.9888	-	-
2.5657	2050	0.9452	-	-
2.6283	2100	0.9378	-	-
2.6909	2150	0.9553	-	-
2.7534	2200	0.9337	-	-
2.8160	2250	1.0184	-	-
2.8786	2300	0.9663	-	-
2.9412	2350	0.9686	-	-
3.0	2397	-	1.3488	0.5442
3.0038	2400	0.9618	-	-
3.0663	2450	0.6878	-	-
3.1289	2500	0.6883	-	-
3.1915	2550	0.6498	-	-
3.2541	2600	0.6651	-	-
3.3166	2650	0.6554	-	-
3.3792	2700	0.7033	-	-
3.4418	2750	0.6416	-	-
3.5044	2800	0.7068	-	-
3.5670	2850	0.6834	-	-
3.6295	2900	0.7099	-	-
3.6921	2950	0.7306	-	-
3.7547	3000	0.7105	-	-
3.8173	3050	0.7072	-	-
3.8798	3100	0.7248	-	-
3.9424	3150	0.7216	-	-
4.0	3196	-	1.3358	0.5307
4.0050	3200	0.693	-	-
4.0676	3250	0.4741	-	-
4.1302	3300	0.4593	-	-
4.1927	3350	0.449	-	-
4.2553	3400	0.4326	-	-
4.3179	3450	0.4488	-	-
4.3805	3500	0.4762	-	-
4.4431	3550	0.4723	-	-
4.5056	3600	0.4713	-	-
4.5682	3650	0.4612	-	-
4.6308	3700	0.4537	-	-
4.6934	3750	0.4928	-	-
4.7559	3800	0.4568	-	-
4.8185	3850	0.4771	-	-
4.8811	3900	0.4688	-	-
4.9437	3950	0.4549	-	-
5.0	3995	-	1.4027	0.5360
5.0063	4000	0.5048	-	-
5.0688	4050	0.2822	-	-
5.1314	4100	0.3069	-	-
5.1940	4150	0.2971	-	-
5.2566	4200	0.3191	-	-
5.3191	4250	0.3023	-	-
5.3817	4300	0.3224	-	-
5.4443	4350	0.3114	-	-
5.5069	4400	0.3098	-	-
5.5695	4450	0.3071	-	-
5.6320	4500	0.3478	-	-
5.6946	4550	0.3288	-	-
5.7572	4600	0.3373	-	-
5.8198	4650	0.3577	-	-
5.8824	4700	0.331	-	-
5.9449	4750	0.3132	-	-
6.0	4794	-	1.4036	0.5148

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.0.1
Transformers: 4.42.4
PyTorch: 2.3.1+cu121
Accelerate: 0.32.1
Datasets: 2.20.0
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification}, 
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}