File size: 2,992 Bytes
ea27939
30b7344
c928d32
 
30b7344
 
 
 
c928d32
 
 
 
30b7344
ea27939
30b7344
d1fdc35
854374c
c928d32
 
 
854374c
8431077
30b7344
c928d32
 
30b7344
 
8431077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30b7344
c928d32
 
 
30b7344
c928d32
 
 
 
 
 
30b7344
 
 
 
 
 
 
24321a2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
pipeline_tag: sentence-similarity
language:
  - de
tags:
- sentence-transformers
- sentence-similarity
- transformers
- setfit
license: mit
datasets:
  - deutsche-telekom/ger-backtrans-paraphrase

---

# German BERT large paraphrase euclidean
This is a [sentence-transformers](https://www.SBERT.net) model.
It maps sentences & paragraphs (text) into a 1024 dimensional dense vector space.
The model is intended to be used together with [SetFit](https://github.com/huggingface/setfit)
to improve German few-shot text classification.
It has a sibling model called
[deutsche-telekom/gbert-large-paraphrase-cosine](https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-cosine).

This model is based on [deepset/gbert-large](https://huggingface.co/deepset/gbert-large).
Many thanks to [deepset](https://www.deepset.ai/)!

## Training

**Loss Function**\
We have used [BatchHardSoftMarginTripletLoss](https://www.sbert.net/docs/package_reference/losses.html#batchhardsoftmargintripletloss) with eucledian distance as the loss function:

``` python
    train_loss = losses.BatchHardSoftMarginTripletLoss(
       model=model,
       distance_metric=BatchHardTripletLossDistanceFunction.eucledian_distance,
   )
```

**Training Data**\
The model is trained on a carefully filtered dataset of
[deutsche-telekom/ger-backtrans-paraphrase](https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase).
We deleted the following pairs of sentences:
- `min_char_len` less than 15
- `jaccard_similarity` greater than 0.3
- `de_token_count` greater than 30
- `en_de_token_count` greater than 30
- `cos_sim` less than 0.85

**Hyperparameters**
- learning_rate: 5.5512022294147105e-06
- num_epochs: 7
- train_batch_size: 68
- num_gpu: ???

## Evaluation Results
We use the [NLU Few-shot Benchmark - English and German](https://huggingface.co/datasets/deutsche-telekom/NLU-few-shot-benchmark-en-de)
dataset to evaluate this model in a German few-shot scenario.

**Qualitative results**
- multilingual sentence embeddings provide the worst results
- Electra models also deliver poor results
- German BERT base size model ([deepset/gbert-base](https://huggingface.co/deepset/gbert-base)) provides good results
- German BERT large size model ([deepset/gbert-large](https://huggingface.co/deepset/gbert-large)) provides very good results
- our fine-tuned models (this model and [deutsche-telekom/gbert-large-paraphrase-cosine](https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-cosine)) provide best results

## Licensing
Copyright (c) 2023 [Philip May](https://may.la/), [Deutsche Telekom AG](https://www.telekom.com/)\
Copyright (c) 2022 [deepset GmbH](https://www.deepset.ai/)

Licensed under the **MIT License** (the "License"); you may not use this file except in compliance with the License.
You may obtain a copy of the License by reviewing the file
[LICENSE](https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-euclidean/blob/main/LICENSE) in the repository.