metadata

pipeline_tag: sentence-similarity
language:
  - de
tags:
  - sentence-transformers
  - sentence-similarity
  - transformers
  - setfit
license: mit
base_model: deepset/gbert-large
datasets:
  - deutsche-telekom/ger-backtrans-paraphrase

German BERT large paraphrase euclidean

This is a sentence-transformers model. It maps sentences & paragraphs (text) into a 1024 dimensional dense vector space. The model is intended to be used together with SetFit to improve German few-shot text classification. It has a sibling model called deutsche-telekom/gbert-large-paraphrase-cosine.

This model is based on deepset/gbert-large. Many thanks to deepset!

Training

Loss Function
We have used BatchHardSoftMarginTripletLoss with eucledian distance as the loss function:

    train_loss = losses.BatchHardSoftMarginTripletLoss(
       model=model,
       distance_metric=BatchHardTripletLossDistanceFunction.eucledian_distance,
   )

Training Data
The model is trained on a carefully filtered dataset of deutsche-telekom/ger-backtrans-paraphrase. We deleted the following pairs of sentences:

min_char_len less than 15
jaccard_similarity greater than 0.3
de_token_count greater than 30
en_de_token_count greater than 30
cos_sim less than 0.85

Hyperparameters

learning_rate: 5.5512022294147105e-06
num_epochs: 7
train_batch_size: 68
num_gpu: ???

Evaluation Results

We use the NLU Few-shot Benchmark - English and German dataset to evaluate this model in a German few-shot scenario.

Qualitative results

multilingual sentence embeddings provide the worst results
Electra models also deliver poor results
German BERT base size model (deepset/gbert-base) provides good results
German BERT large size model (deepset/gbert-large) provides very good results
our fine-tuned models (this model and deutsche-telekom/gbert-large-paraphrase-cosine) provide best results

Licensing

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.