German RoBERTa for Sentence Embeddings V2

The new T-Systems-onsite/cross-en-de-roberta-sentence-transformer model is slightly better for German language. It is also the current best model for English language and works cross-lingually. Please consider using that model.

This model is intended to compute sentence (text embeddings) for German text. These embeddings can then be compared with cosine-similarity to find sentences with a similar semantic meaning. For example this can be useful for semantic textual similarity, semantic search, or paraphrase mining. To do this you have to use the Sentence Transformers Python framework.

Sentence-BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

Source: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

This model is fine-tuned from Philip May and open-sourced by T-Systems-onsite. Special thanks to Nils Reimers for your awesome open-source work, the Sentence Transformers, the models and your help on GitHub.

How to use

The usage description above - provided by Hugging Face - is wrong for sentence embeddings! Please use this:

To use this model install the sentence-transformers package (see here:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T-Systems-onsite/german-roberta-sentence-transformer-v2')

For details of usage and examples see here:


The base model is xlm-roberta-base. This model has been further trained by Nils Reimers on a large scale paraphrase dataset for 50+ languages. Nils Reimers about this on GitHub:

A paper is upcoming for the paraphrase models.

These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI-entailment pairs with in-batch-negative loss etc.

In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.

More details with the setup, all the datasets, and a wider evaluation will follow soon.

The resulting model called xlm-r-distilroberta-base-paraphrase-v1 has been released here:

Building on this cross language model we fine-tuned it for German language on the dataset of our German STSbenchmark dataset.

We did an automatic hyperparameter search for 102 trials with Optuna. Using 10-fold crossvalidation on the test and dev dataset we found the following best hyperparameters:

  • batch_size = 15
  • num_epochs = 4
  • lr = 2.2995320905210864e-05
  • eps = 1.8979875906303792e-06
  • weight_decay = 0.003314045812507563
  • warmup_steps_proportion = 0.46141685205829014

The final model was trained with these hyperparameters on the combination of sts_de_train.csv and sts_de_dev.csv. The sts_de_test.csv was left for testing.


The evaluation has been done on the test set of our German STSbenchmark dataset. The code is available on Colab. As the metric for evaluation we use the Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and STSbenchmark labels.

Model Name Spearman rank correlation
xlm-r-distilroberta-base-paraphrase-v1 0.8079
xlm-r-100langs-bert-base-nli-stsb-mean-tokens 0.8194
xlm-r-bert-base-nli-stsb-mean-tokens 0.8194
Downloads last month
Hosted inference API

Unable to determine this model’s pipeline type. Check the docs .