A version of the chcaa/dfm-encoder-large-v1 trained using SimCSE. It was trained as a part of the Scandinavian Embeddings Benchmark to establish a naive baseline for SimCSE.
Note: We do not recommend this model, but instead encourage the user to check out the current best model on SEB or check out the recommendation by the Danish Foundation Models team.
Hyperparameters
Trained using the SimCSE implementation with:
CUDA_VISIBLE_DEVICES=0 python train.py \
--train_file data/dfm_paragraphs.txt \ # paragraphs extract from Danish Gigaword
--model_name_or_path chcaa/dfm-encoder-large-v1 \
--num_train_epochs 1 \
--per_device_train_batch_size 128 \
--learning_rate 1e-5 \
--max_seq_length 32 \
--evaluation_strategy steps \
--metric_for_best_model stsb_spearman \
--load_best_model_at_end \
--pooler_type cls \
--mlp_only_train \
--do_mlm \
--overwrite_output_dir \
--temp 0.05 \
--do_train \
--fp16
Citation
To cite this work please refer to the following article:
Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding. https://openreview.net/forum?id=pJl_i7HIA72
or use the following BibTeX:
@article{enevoldsenScandinavianEmbeddingBenchmarks2024,
title = {The {Scandinavian} {Embedding} {Benchmarks}: {Comprehensive} {Assessment} of {Multilingual} and {Monolingual} {Text} {Embedding}},
shorttitle = {The {Scandinavian} {Embedding} {Benchmarks}},
url = {https://openreview.net/forum?id=pJl_i7HIA72},
language = {en},
urldate = {2024-04-12},
author = {Enevoldsen, Kenneth and Kardos, Márton and Muennighoff, Niklas and Nielbo, Kristoffer},
month = feb,
year = {2024},
}
- Downloads last month
- 9
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.