deutsche-telekom
/

gbert-large-paraphrase-euclidean

Sentence Similarity

sentence-transformers

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

gbert-large-paraphrase-euclidean / README.md

PhilipMay's picture

fix license link

24321a2 over 1 year ago

|

raw history blame contribute delete

No virus

2.99 kB

	---
	pipeline_tag: sentence-similarity
	language:
	- de
	tags:
	- sentence-transformers
	- sentence-similarity
	- transformers
	- setfit
	license: mit
	datasets:
	- deutsche-telekom/ger-backtrans-paraphrase

	---

	# German BERT large paraphrase euclidean
	This is a [sentence-transformers](https://www.SBERT.net) model.
	It maps sentences & paragraphs (text) into a 1024 dimensional dense vector space.
	The model is intended to be used together with [SetFit](https://github.com/huggingface/setfit)
	to improve German few-shot text classification.
	It has a sibling model called
	[deutsche-telekom/gbert-large-paraphrase-cosine](https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-cosine).

	This model is based on [deepset/gbert-large](https://huggingface.co/deepset/gbert-large).
	Many thanks to [deepset](https://www.deepset.ai/)!

	## Training

	Loss Function\
	We have used [BatchHardSoftMarginTripletLoss](https://www.sbert.net/docs/package_reference/losses.html#batchhardsoftmargintripletloss) with eucledian distance as the loss function:

	``` python
	train_loss = losses.BatchHardSoftMarginTripletLoss(
	model=model,
	distance_metric=BatchHardTripletLossDistanceFunction.eucledian_distance,
	)
	```

	Training Data\
	The model is trained on a carefully filtered dataset of
	[deutsche-telekom/ger-backtrans-paraphrase](https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase).
	We deleted the following pairs of sentences:
	- `min_char_len` less than 15
	- `jaccard_similarity` greater than 0.3
	- `de_token_count` greater than 30
	- `en_de_token_count` greater than 30
	- `cos_sim` less than 0.85

	Hyperparameters
	- learning_rate: 5.5512022294147105e-06
	- num_epochs: 7
	- train_batch_size: 68
	- num_gpu: ???

	## Evaluation Results
	We use the [NLU Few-shot Benchmark - English and German](https://huggingface.co/datasets/deutsche-telekom/NLU-few-shot-benchmark-en-de)
	dataset to evaluate this model in a German few-shot scenario.

	Qualitative results
	- multilingual sentence embeddings provide the worst results
	- Electra models also deliver poor results
	- German BERT base size model ([deepset/gbert-base](https://huggingface.co/deepset/gbert-base)) provides good results
	- German BERT large size model ([deepset/gbert-large](https://huggingface.co/deepset/gbert-large)) provides very good results
	- our fine-tuned models (this model and [deutsche-telekom/gbert-large-paraphrase-cosine](https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-cosine)) provide best results

	## Licensing
	Copyright (c) 2023 [Philip May](https://may.la/), [Deutsche Telekom AG](https://www.telekom.com/)\
	Copyright (c) 2022 [deepset GmbH](https://www.deepset.ai/)

	Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License.
	You may obtain a copy of the License by reviewing the file
	[LICENSE](https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-euclidean/blob/main/LICENSE) in the repository.