SentenceTransformer
This is a sentence-transformers model trained on the corpus dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- corpus
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("whitemouse84/ModernBERT-base-en-ru-v1")
# Run inference
sentences = [
'Transparency is absolutely critical to this.',
'Прозрачность - абсолютно критична в этом процессе.',
'Мы покупаем его нашим детям.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Knowledge Distillation
- Datasets:
small_content
andbig_content
- Evaluated with
MSEEvaluator
Metric | small_content | big_content |
---|---|---|
negative_mse | -4.3569 | -3.5414 |
Translation
- Datasets:
small_content
andbig_content
- Evaluated with
TranslationEvaluator
Metric | small_content | big_content |
---|---|---|
src2trg_accuracy | 0.7375 | 0.8285 |
trg2src_accuracy | 0.665 | 0.668 |
mean_accuracy | 0.7013 | 0.7483 |
Encodechka
Model | STS | PI | NLI | SA | TI | IA | IC | ICX |
---|---|---|---|---|---|---|---|---|
ModernBERT-base-en-ru-v1 | 0.602 | 0.521 | 0.355 | 0.722 | 0.892 | 0.704 | 0.747 | 0.591 |
ModernBERT-base | 0.498 | 0.239 | 0.358 | 0.643 | 0.786 | 0.623 | 0.593 | 0.104 |
EuroBERT-210m | 0.619 | 0.452 | 0.369 | 0.702 | 0.875 | 0.703 | 0.647 | 0.192 |
xlm-roberta-base | 0.552 | 0.439 | 0.362 | 0.752 | 0.940 | 0.768 | 0.695 | 0.520 |
Training Details
Training Dataset
corpus
- Dataset: corpus
- Size: 2,000,000 training samples
- Columns:
english
,non_english
, andlabel
- Approximate statistics based on the first 1000 samples:
english non_english label type string string list details - min: 4 tokens
- mean: 29.26 tokens
- max: 133 tokens
- min: 7 tokens
- mean: 71.46 tokens
- max: 285 tokens
- size: 768 elements
- Samples:
english non_english label Hence it can be said that Voit is a well-satisfied customer, and completely convinced of the potential offered by Voortman machines for his firm.
В конечном итоге можно утверждать, что компания Voit довольна своим выбором, ведь она имела возможность убедиться в качественных характеристиках оборудования Voortman.
[0.1702279895544052, -0.6711388826370239, -0.5062062740325928, 0.14078450202941895, 0.15188495814800262, ...]
We want to feel good, we want to be happy, in fact happiness is our birthright.
Мы хотим чувствовать себя хорошо, хотим быть счастливы.
[0.556108295917511, -0.42819586396217346, -0.25372204184532166, 0.099883534014225, 0.7299532294273376, ...]
In Germany, Arcandor - a major holding company in the mail order, retail and tourism industries that reported €21 billion in 2007 sales - threatens to become the first victim of tighter credit terms.
В Германии Arcandor - ключевая холдинговая компания в сфере посылочной и розничной торговли, а также индустрии туризма, в финансовых отчетах которой за 2007 год значился торговый оборот в размере €21 миллиардов - грозит стать первой жертвой ужесточения условий кредитования.
[-0.27140647172927856, -0.5173773169517517, -0.6571329236030579, 0.21765929460525513, -0.01978394016623497, ...]
- Loss:
MSELoss
Evaluation Datasets
small_content
- Dataset: small_content
- Size: 2,000 evaluation samples
- Columns:
english
,non_english
, andlabel
- Approximate statistics based on the first 1000 samples:
english non_english label type string string list details - min: 4 tokens
- mean: 24.13 tokens
- max: 252 tokens
- min: 5 tokens
- mean: 53.83 tokens
- max: 406 tokens
- size: 768 elements
- Samples:
english non_english label Thank you so much, Chris.
Спасибо, Крис.
[1.0408389568328857, 0.3253674805164337, -0.12651680409908295, 0.45153331756591797, 0.4052223563194275, ...]
And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.
Это огромная честь, получить возможность выйти на эту сцену дважды. Я неимоверно благодарен.
[0.6990637183189392, -0.4462655782699585, -0.5292129516601562, 0.23709823191165924, 0.32307693362236023, ...]
I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.
Я в восторге от этой конференции, и я хочу поблагодарить вас всех за благожелательные отзывы о моем позавчерашнем выступлении.
[0.8470447063446045, -0.17461800575256348, -0.7178670167922974, 0.6488378047943115, 0.6101466417312622, ...]
- Loss:
MSELoss
big_content
- Dataset: big_content
- Size: 2,000 evaluation samples
- Columns:
english
,non_english
, andlabel
- Approximate statistics based on the first 1000 samples:
english non_english label type string string list details - min: 6 tokens
- mean: 43.84 tokens
- max: 141 tokens
- min: 10 tokens
- mean: 107.9 tokens
- max: 411 tokens
- size: 768 elements
- Samples:
english non_english label India has recorded a surge in COVID-19 cases in the past weeks, with over 45,000 new cases detected every day since July 23.
Индия зафиксировала резкий всплеск случаев заражения COVID-19 за последние недели, с 23 июля каждый день выявляется более 45 000 новых случаев.
[-0.12528948485851288, -0.49428656697273254, -0.07556094229221344, 0.8069225549697876, 0.20946118235588074, ...]
A bloom the Red Tide extends approximately 130 miles of coastline from northern Pinellas to southern Lee counties.
Цветение Красного Прилива простирается примерно на 130 миль дволь береговой линии от Пинеллас на севере до округа Ли на юге.
[0.027262285351753235, -0.4401558041572571, -0.3353440463542938, 0.11166133731603622, -0.2294958084821701, ...]
Among those affected by the new rules is Transport Secretary Grant Shapps, who began his holiday in Spain on Saturday.
Среди тех, кого затронули новые правила, оказался министр транспорта Грант Шэппс, у которого в субботу начался отпуск в Испании.
[0.1868007630109787, -0.18781621754169464, -0.48890581727027893, 0.328614205121994, 0.36041054129600525, ...]
- Loss:
MSELoss
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 4per_device_eval_batch_size
: 4gradient_accumulation_steps
: 16learning_rate
: 2e-05num_train_epochs
: 1warmup_ratio
: 0.1bf16
: True
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 4per_device_eval_batch_size
: 4per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 16eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Truefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Framework Versions
- Python: 3.13.2
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0+cu126
- Accelerate: 1.4.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MSELoss
@inproceedings{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2004.09813",
}
- Downloads last month
- 11
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for whitemouse84/ModernBERT-base-en-ru-v1
Base model
answerdotai/ModernBERT-baseEvaluation results
- Negative Mse on small contentself-reported-4.357
- Src2Trg Accuracy on small contentself-reported0.738
- Trg2Src Accuracy on small contentself-reported0.665
- Mean Accuracy on small contentself-reported0.701
- Negative Mse on big contentself-reported-3.541
- Src2Trg Accuracy on big contentself-reported0.829
- Trg2Src Accuracy on big contentself-reported0.668
- Mean Accuracy on big contentself-reported0.748