KennethEnevoldsen/dfm-sentence-encoder-large

A version of the chcaa/dfm-encoder-large-v1 trained using SimCSE. It was trained as a part of the Scandinavian Embeddings Benchmark to establish a naive baseline for SimCSE.

Note: We do not recommend this model, but instead encourage the user to check out the current best model on SEB or check out the recommendation by the Danish Foundation Models team.

Hyperparameters

Trained using the SimCSE implementation with:

CUDA_VISIBLE_DEVICES=0 python train.py \
    --train_file data/dfm_paragraphs.txt \ # paragraphs extract from Danish Gigaword
    --model_name_or_path chcaa/dfm-encoder-large-v1 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 128 \
    --learning_rate 1e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --pooler_type cls \
    --mlp_only_train \
    --do_mlm \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --fp16

Citation

To cite this work please refer to the following article:

Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding. https://openreview.net/forum?id=pJl_i7HIA72

or use the following BibTeX:

@article{enevoldsenScandinavianEmbeddingBenchmarks2024,
    title = {The {Scandinavian} {Embedding} {Benchmarks}: {Comprehensive} {Assessment} of {Multilingual} and {Monolingual} {Text} {Embedding}},
    shorttitle = {The {Scandinavian} {Embedding} {Benchmarks}},
    url = {https://openreview.net/forum?id=pJl_i7HIA72},
    language = {en},
    urldate = {2024-04-12},
    author = {Enevoldsen, Kenneth and Kardos, Márton and Muennighoff, Niklas and Nielbo, Kristoffer},
    month = feb,
    year = {2024},
}