|
--- |
|
base_model: intfloat/multilingual-e5-large-instruct |
|
language: |
|
- de |
|
- en |
|
library_name: sentence-transformers |
|
metrics: |
|
- cosine_accuracy |
|
- cosine_accuracy_threshold |
|
- cosine_f1 |
|
- cosine_f1_threshold |
|
- cosine_precision |
|
- cosine_recall |
|
- cosine_ap |
|
- dot_accuracy |
|
- dot_accuracy_threshold |
|
- dot_f1 |
|
- dot_f1_threshold |
|
- dot_precision |
|
- dot_recall |
|
- dot_ap |
|
- manhattan_accuracy |
|
- manhattan_accuracy_threshold |
|
- manhattan_f1 |
|
- manhattan_f1_threshold |
|
- manhattan_precision |
|
- manhattan_recall |
|
- manhattan_ap |
|
- euclidean_accuracy |
|
- euclidean_accuracy_threshold |
|
- euclidean_f1 |
|
- euclidean_f1_threshold |
|
- euclidean_precision |
|
- euclidean_recall |
|
- euclidean_ap |
|
- max_accuracy |
|
- max_accuracy_threshold |
|
- max_f1 |
|
- max_f1_threshold |
|
- max_precision |
|
- max_recall |
|
- max_ap |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:2122578 |
|
- multilingual |
|
widget: |
|
- source_sentence: >- |
|
Instruct: Retrieve semantically similar text. |
|
|
|
Query: Finally, we will pay a great deal of attention to the preparation of |
|
your White Paper on the reform of the EU budget. We ask you to involve our |
|
Parliament in this task. |
|
sentences: |
|
- Sie lehren den Mönchen auch, dass der Dalai Lama eine Gefahr dafür ist. |
|
- >- |
|
Des Weiteren werden wir auch sehr intensiv die Vorbereitung Ihres Weißbuchs |
|
über die Reform des Gemeinschaftshaushaltes im Auge behalten und Sie bitten, |
|
das Parlament zu dieser Aufgabe einzubeziehen. |
|
- >- |
|
Frau Maij-Weggen, als erstaunlichermaßen geübter Parlamentarist meiner |
|
Herkunft kann ich sagen, dass ich über 20 Jahre im Parlament bin. |
|
- source_sentence: >- |
|
Instruct: Retrieve semantically similar text. |
|
|
|
Query: I wonder if, Commissioner, you could confirm that the European |
|
Network on Independent Living, as a user-led organisation run for and by |
|
disabled people, represents a very important interest group for disabled |
|
people in Europe and that you would join me in hoping that they could |
|
benefit from the support for coordination of disability organisations |
|
Europe-wide currently benefiting organisations such as the European Blind |
|
Union and the European Union of the Deaf in the future. |
|
sentences: |
|
- >- |
|
Meiner Meinung nach muss die Union sicherstellen, dass sie diese |
|
Unterstützung erhalten, sowie dafür sorgen, dass ihre volle Anerkennung |
|
durch die Gesellschaft gefördert wird. |
|
- >- |
|
Daher muss bei der Sitzung des Menschenrechtsrates in Genf festgelegte und |
|
spezifische Prioritäten gesetzt werden. Dazu gehört natürlich die Bekämpfung |
|
von Diskriminierung in ihrer verschiedenen Form im Hinblick auf Rasse, |
|
sexuelle Orientierung, Religion, politische Orientierung, während es |
|
andererseits auch darum geht, Minderheiten und verletzliche Gruppen zu |
|
schützen. |
|
- >- |
|
Könnten Sie, Frau Kommissarin, bestätigen, dass das Europäische Netz für ein |
|
selbstständiges Leben, als eine für und von Behinderten geführte |
|
Organisation unter Beteiligung der Nutzer, eine sehr bedeutende |
|
Interessengruppe für Behinderte in Europa darstellt, und würden Sie sich mit |
|
mir in der Hoffnung fühlen, dass sie in Zukunft die Unterstützung zur |
|
Koordinierung von Behindertenorganisationen in ganz Europa erhalten können, |
|
von der bereits Organisationen wie die Europäische Blinde Union und die |
|
- source_sentence: >- |
|
Instruct: Retrieve semantically similar text. |
|
|
|
Query: I would definitely like to call on you, Mrs Fischer Boel, to |
|
reiterate what you have done. |
|
sentences: |
|
- >- |
|
Ich würde Sie, Frau Fischer Boel, ersuchen, zu wiederholen, was Sie alles |
|
getan haben. |
|
- >- |
|
Wir müssen uns gründlich mit dem Spektrum vorhandener Optionen der Union |
|
befassen, wozu auch eine Vereinbarung über eine gemeinsame Politik der |
|
legalen Einwanderung gehört. |
|
- >- |
|
Verglichen mit einem identischen Lebenslauf, gleicher Ausbildung und |
|
gleicher Laufbahn besitzt ein junger Mann, der sich nach einem französischen |
|
Namen bedient, die vier- bis fünffache Rekrutierungschancen als ein junger |
|
Mann mit einem nördlich-afrikanischen Namen und doppelte als ein junger |
|
Mann, der sich nach einem portugiesischen oder spanischen Namen |
|
identifiziert hat. |
|
- source_sentence: >- |
|
Instruct: Retrieve semantically similar text. |
|
|
|
Query: Als Abgeordnete des Europäischen Parlaments sind wir direkt gewählt |
|
und unseren Wählern gegenüber rechenschaftspflichtig. |
|
sentences: |
|
- >- |
|
Die Abgeordneten sind direkt gewählt und haben vor unseren Wählern zu |
|
rechenschaftspflichtig zu sein. |
|
- >- |
|
Wenn die Türkei ihren Verpflichtungen gegenüber Zypern nachkommt, dann durch |
|
die Anerkennung der Republik Zypern und durch das Ende der Besetzung |
|
Zyperns. |
|
- >- |
|
Die brasilianische Regierung ist dabei, Rechtsvorschriften durchzusetzen, |
|
die das völlige Verbot der Herstellung und des Verkaufs von Zigaretten mit |
|
verschiedenen Inhaltsstoffen, bekannt als Tabakmischungen, vorsehen. |
|
- source_sentence: >- |
|
Instruct: Retrieve semantically similar text. |
|
|
|
Query: Secondly, Mr President, we should reflect on whether the |
|
disappearance of Mr Ben Ali from the political scene is sufficient to |
|
guarantee that a real democratic transition will be brought about. |
|
sentences: |
|
- >- |
|
Man schätzt, dass der Höhepunkt der Arbeitslosenrate erst in den nächsten |
|
zwei oder drei Quartalen zu verzeichnen ist. |
|
- Das ist immer das Risiko wenn man Shorts machen will. |
|
- >- |
|
Zweitens, Herr Präsident, sollten wir darüber nachdenken, ob das |
|
Verschwinden von Herrn Ben Ali von der politischen Bühne ausreicht, um einen |
|
wirklich demokratischen Übergang zu gewährleisten. |
|
model-index: |
|
- name: SentenceTransformer based on intfloat/multilingual-e5-large-instruct |
|
results: |
|
- task: |
|
type: binary-classification |
|
name: Binary Classification |
|
dataset: |
|
name: euro parl binary |
|
type: euro-parl-binary |
|
metrics: |
|
- type: cosine_accuracy |
|
value: 0.9980020293006387 |
|
name: Cosine Accuracy |
|
- type: cosine_accuracy_threshold |
|
value: 0.4145926237106323 |
|
name: Cosine Accuracy Threshold |
|
- type: cosine_f1 |
|
value: 0.9980003135684363 |
|
name: Cosine F1 |
|
- type: cosine_f1_threshold |
|
value: 0.4145926237106323 |
|
name: Cosine F1 Threshold |
|
- type: cosine_precision |
|
value: 0.9988580703701746 |
|
name: Cosine Precision |
|
- type: cosine_recall |
|
value: 0.9971440286784372 |
|
name: Cosine Recall |
|
- type: cosine_ap |
|
value: 0.9998129501710422 |
|
name: Cosine Ap |
|
- type: dot_accuracy |
|
value: 0.9980020293006387 |
|
name: Dot Accuracy |
|
- type: dot_accuracy_threshold |
|
value: 0.4145926237106323 |
|
name: Dot Accuracy Threshold |
|
- type: dot_f1 |
|
value: 0.9980003135684363 |
|
name: Dot F1 |
|
- type: dot_f1_threshold |
|
value: 0.4145926237106323 |
|
name: Dot F1 Threshold |
|
- type: dot_precision |
|
value: 0.9988580703701746 |
|
name: Dot Precision |
|
- type: dot_recall |
|
value: 0.9971440286784372 |
|
name: Dot Recall |
|
- type: dot_ap |
|
value: 0.9998129502134064 |
|
name: Dot Ap |
|
- type: manhattan_accuracy |
|
value: 0.997992685283613 |
|
name: Manhattan Accuracy |
|
- type: manhattan_accuracy_threshold |
|
value: 27.316463470458984 |
|
name: Manhattan Accuracy Threshold |
|
- type: manhattan_f1 |
|
value: 0.9979907371441837 |
|
name: Manhattan F1 |
|
- type: manhattan_f1_threshold |
|
value: 27.327720642089844 |
|
name: Manhattan F1 Threshold |
|
- type: manhattan_precision |
|
value: 0.9989602482187091 |
|
name: Manhattan Precision |
|
- type: manhattan_recall |
|
value: 0.9970231061051609 |
|
name: Manhattan Recall |
|
- type: manhattan_ap |
|
value: 0.9998111879697278 |
|
name: Manhattan Ap |
|
- type: euclidean_accuracy |
|
value: 0.9980020293006387 |
|
name: Euclidean Accuracy |
|
- type: euclidean_accuracy_threshold |
|
value: 1.0820419788360596 |
|
name: Euclidean Accuracy Threshold |
|
- type: euclidean_f1 |
|
value: 0.9980003135684363 |
|
name: Euclidean F1 |
|
- type: euclidean_f1_threshold |
|
value: 1.0820419788360596 |
|
name: Euclidean F1 Threshold |
|
- type: euclidean_precision |
|
value: 0.9988580703701746 |
|
name: Euclidean Precision |
|
- type: euclidean_recall |
|
value: 0.9971440286784372 |
|
name: Euclidean Recall |
|
- type: euclidean_ap |
|
value: 0.9998129500756795 |
|
name: Euclidean Ap |
|
- type: max_accuracy |
|
value: 0.9980020293006387 |
|
name: Max Accuracy |
|
- type: max_accuracy_threshold |
|
value: 27.316463470458984 |
|
name: Max Accuracy Threshold |
|
- type: max_f1 |
|
value: 0.9980003135684363 |
|
name: Max F1 |
|
- type: max_f1_threshold |
|
value: 27.327720642089844 |
|
name: Max F1 Threshold |
|
- type: max_precision |
|
value: 0.9989602482187091 |
|
name: Max Precision |
|
- type: max_recall |
|
value: 0.9971440286784372 |
|
name: Max Recall |
|
- type: max_ap |
|
value: 0.9998129502134064 |
|
name: Max Ap |
|
- task: |
|
type: triplet |
|
name: Triplet |
|
dataset: |
|
name: euro parl triplet |
|
type: euro-parl-triplet |
|
metrics: |
|
- type: cosine_accuracy |
|
value: 0.9997504597806025 |
|
name: Cosine Accuracy |
|
- type: dot_accuracy |
|
value: 0.00024954021939751976 |
|
name: Dot Accuracy |
|
- type: manhattan_accuracy |
|
value: 0.9997515590767232 |
|
name: Manhattan Accuracy |
|
- type: euclidean_accuracy |
|
value: 0.9997504597806025 |
|
name: Euclidean Accuracy |
|
- type: max_accuracy |
|
value: 0.9997515590767232 |
|
name: Max Accuracy |
|
--- |
|
|
|
# SentenceTransformer based on intfloat/multilingual-e5-large-instruct |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct). The model was trained on the en-de subset of the [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl) which was augmented by translating the English texts to German using [t5-large](https://huggingface.co/google-t5/t5-large).It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity and paraphrase mining. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) <!-- at revision baa7be480a7de1539afce709c8f13f833a510e0a --> |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 1024 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
<!-- - **Training Dataset:** Unknown --> |
|
<!-- - **Language:** Unknown --> |
|
<!-- - **License:** Unknown --> |
|
|
|
### Model Sources |
|
|
|
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net) |
|
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) |
|
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
(2): Normalize() |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("Sami92/multilingual-e5-large-instruct-eu-parl-de") |
|
# Run inference |
|
sentences = [ |
|
'Instruct: Retrieve semantically similar text.\nQuery: Secondly, Mr President, we should reflect on whether the disappearance of Mr Ben Ali from the political scene is sufficient to guarantee that a real democratic transition will be brought about.', |
|
'Zweitens, Herr Präsident, sollten wir darüber nachdenken, ob das Verschwinden von Herrn Ben Ali von der politischen Bühne ausreicht, um einen wirklich demokratischen Übergang zu gewährleisten.', |
|
'Man schätzt, dass der Höhepunkt der Arbeitslosenrate erst in den nächsten zwei oder drei Quartalen zu verzeichnen ist.', |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 1024] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Binary Classification |
|
* Dataset: `euro-parl-binary` |
|
* Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator) |
|
|
|
| Metric | Value | |
|
|:-----------------------------|:-----------| |
|
| cosine_accuracy | 0.998 | |
|
| cosine_accuracy_threshold | 0.4146 | |
|
| cosine_f1 | 0.998 | |
|
| cosine_f1_threshold | 0.4146 | |
|
| cosine_precision | 0.9989 | |
|
| cosine_recall | 0.9971 | |
|
| cosine_ap | 0.9998 | |
|
| dot_accuracy | 0.998 | |
|
| dot_accuracy_threshold | 0.4146 | |
|
| dot_f1 | 0.998 | |
|
| dot_f1_threshold | 0.4146 | |
|
| dot_precision | 0.9989 | |
|
| dot_recall | 0.9971 | |
|
| dot_ap | 0.9998 | |
|
| manhattan_accuracy | 0.998 | |
|
| manhattan_accuracy_threshold | 27.3165 | |
|
| manhattan_f1 | 0.998 | |
|
| manhattan_f1_threshold | 27.3277 | |
|
| manhattan_precision | 0.999 | |
|
| manhattan_recall | 0.997 | |
|
| manhattan_ap | 0.9998 | |
|
| euclidean_accuracy | 0.998 | |
|
| euclidean_accuracy_threshold | 1.082 | |
|
| euclidean_f1 | 0.998 | |
|
| euclidean_f1_threshold | 1.082 | |
|
| euclidean_precision | 0.9989 | |
|
| euclidean_recall | 0.9971 | |
|
| euclidean_ap | 0.9998 | |
|
| max_accuracy | 0.998 | |
|
| max_accuracy_threshold | 27.3165 | |
|
| max_f1 | 0.998 | |
|
| max_f1_threshold | 27.3277 | |
|
| max_precision | 0.999 | |
|
| max_recall | 0.9971 | |
|
| **max_ap** | **0.9998** | |
|
|
|
#### Triplet |
|
* Dataset: `euro-parl-triplet` |
|
* Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator) |
|
|
|
| Metric | Value | |
|
|:-------------------|:-----------| |
|
| cosine_accuracy | 0.9998 | |
|
| dot_accuracy | 0.0002 | |
|
| manhattan_accuracy | 0.9998 | |
|
| euclidean_accuracy | 0.9998 | |
|
| **max_accuracy** | **0.9998** | |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
### Training Dataset |
|
|
|
#### Unnamed Dataset |
|
|
|
|
|
* Size: 2,122,578 training samples |
|
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code> |
|
* Approximate statistics based on the first 1000 samples: |
|
| | anchor | positive | negative | |
|
|:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------| |
|
| type | string | string | string | |
|
| details | <ul><li>min: 25 tokens</li><li>mean: 52.46 tokens</li><li>max: 144 tokens</li></ul> | <ul><li>min: 14 tokens</li><li>mean: 42.14 tokens</li><li>max: 132 tokens</li></ul> | <ul><li>min: 14 tokens</li><li>mean: 41.42 tokens</li><li>max: 142 tokens</li></ul> | |
|
* Samples: |
|
| anchor | positive | negative | |
|
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| <code>Instruct: Retrieve semantically similar text.<br>Query: What is more, the issues that were being disputed had already been resolved in the Convention, including the scope of the Charter of Fundamental Rights.</code> | <code>Und worüber da argumentiert wurde, ist alles schon im Konvent geregelt worden, auch die Dimension der Grundrechtecharta.</code> | <code>Ein kleines Beispiel aus dem Bundesland, aus dem ich komme: Da gibt es mehrere Universitäten, aber nehmen wir einmal eine als Beispiel.</code> | |
|
| <code>Instruct: Retrieve semantically similar text.<br>Query: Wie Sie wissen, werden wir nach den Dringlichkeiten kurz bis 17.30 Uhr unterbrechen, und dann wird entschieden, ob die Aussprache Deprez stattfindet oder nicht.</code> | <code>Infolge der Dringlichkeitsdebatte haben wir bekanntlich eine kurze Pause bis 17.30 Uhr. Danach wird über die Vertagung oder Nichtvertagung für den Deprez-Bericht entschieden.</code> | <code>Die heutige Erklärung von Romano Prodi auf der Grundlage einer schwedischen Zeitung, Schweden könne von der WWU ausgeschlossen bleiben, ist daher in wirtschaftlicher Hinsicht aufgeschlossen und auf Vertragsebene zweifelhaft.</code> | |
|
| <code>Instruct: Retrieve semantically similar text.<br>Query: The Committee of the Regions has made a proposal to that effect and I would recommend that you consider it, at the same time as you consider the wording of our resolution.</code> | <code>Der Ausschuss der Regionen hat einen Vorschlag in diese Richtung gebracht, und ich empfehle, diesen zu prüfen, so wie Sie die Formulierung unserer Entschließung prüfen müssen.</code> | <code>Hoffentlich wird die vom Parlament eingesetzte Arbeitsgruppe für Finanzkrisen zu einer neuen Quelle von Ratschlägen gelangen.</code> | |
|
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: |
|
```json |
|
{ |
|
"scale": 20.0, |
|
"similarity_fct": "cos_sim" |
|
} |
|
``` |
|
|
|
### Evaluation Dataset |
|
|
|
#### Unnamed Dataset |
|
|
|
|
|
* Size: 909,673 evaluation samples |
|
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code> |
|
* Approximate statistics based on the first 1000 samples: |
|
| | anchor | positive | negative | |
|
|:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------| |
|
| type | string | string | string | |
|
| details | <ul><li>min: 26 tokens</li><li>mean: 53.62 tokens</li><li>max: 169 tokens</li></ul> | <ul><li>min: 13 tokens</li><li>mean: 43.02 tokens</li><li>max: 127 tokens</li></ul> | <ul><li>min: 13 tokens</li><li>mean: 42.03 tokens</li><li>max: 154 tokens</li></ul> | |
|
* Samples: |
|
| anchor | positive | negative | |
|
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| <code>Instruct: Retrieve semantically similar text.<br>Query: I believe that it is a positive regulation and is very important if the European Union is to have efficient maritime and port services.</code> | <code>Ich denke, es ist eine positive Verordnung und äußerst wichtig, wenn die Europäische Union leistungsfähige maritime und Hafendienste haben soll.</code> | <code>Dazu gehören Kohäsion, Forschung und Entwicklung, Energie, Verkehr und das ganze Kapitel Außenhilfe, Entwicklung und Erweiterung.</code> | |
|
| <code>Instruct: Retrieve semantically similar text.<br>Query: For this reason, this report on the recovery of Community funds takes as its starting point a specific example: never before have we had a bigger sum to recover - it is put at almost EUR 100 million - and the circumstances surrounding the missing money are worse still.</code> | <code>Deshalb wird in diesem Bericht über die Einziehung von Gemeinschaftsmitteln ein konkretes Beispiel als Ausgangspunkt gewählt: Nie zuvor mussten wir einen höheren Betrag einziehen - er wird auf nahezu 100 Millionen Euro beziffert -, und die Umstände, wie es zu der Fehlsumme kam, sind umso schlimmer.</code> | <code>B5-0433/2000 vom Abgeordneten Van den Bos im Namen der ELDR-Fraktion</code> | |
|
| <code>Instruct: Retrieve semantically similar text.<br>Query: I am therefore pleased that the Tuberculosis Vaccine Initiative model has been drawn up, because all patients suffering from tuberculosis worldwide will be able to benefit from the results of research and work activities.</code> | <code>Ich bin daher sehr erfreut, dass das Modell der Tuberkulose-Impfstoffinitiative aufgestellt wurde, denn alle Tuberkulosepatienten weltweit können von den Ergebnissen von Forschungs- und Arbeitsaktivitäten profitieren.</code> | <code>Ja, die Mitgliedstaaten und die Abgeordneten sind proportional vertreten, wobei im einzelnen noch die genaue Anzahl zu erörtern sein wird.</code> | |
|
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: |
|
```json |
|
{ |
|
"scale": 20.0, |
|
"similarity_fct": "cos_sim" |
|
} |
|
``` |
|
|
|
### Training Hyperparameters |
|
#### Non-Default Hyperparameters |
|
|
|
- `eval_strategy`: steps |
|
- `per_device_train_batch_size`: 32 |
|
- `per_device_eval_batch_size`: 16 |
|
- `gradient_accumulation_steps`: 4 |
|
- `learning_rate`: 0.0001 |
|
- `num_train_epochs`: 1 |
|
- `fp16`: True |
|
- `load_best_model_at_end`: True |
|
- `push_to_hub`: True |
|
- `hub_model_id`: Sami92/multilingual-e5-large-instruct-eu-parl-de |
|
- `gradient_checkpointing`: True |
|
- `push_to_hub_model_id`: multilingual-e5-large-instruct-eu-parl-de |
|
|
|
#### All Hyperparameters |
|
<details><summary>Click to expand</summary> |
|
|
|
- `overwrite_output_dir`: False |
|
- `do_predict`: False |
|
- `eval_strategy`: steps |
|
- `prediction_loss_only`: True |
|
- `per_device_train_batch_size`: 32 |
|
- `per_device_eval_batch_size`: 16 |
|
- `per_gpu_train_batch_size`: None |
|
- `per_gpu_eval_batch_size`: None |
|
- `gradient_accumulation_steps`: 4 |
|
- `eval_accumulation_steps`: None |
|
- `learning_rate`: 0.0001 |
|
- `weight_decay`: 0.0 |
|
- `adam_beta1`: 0.9 |
|
- `adam_beta2`: 0.999 |
|
- `adam_epsilon`: 1e-08 |
|
- `max_grad_norm`: 1.0 |
|
- `num_train_epochs`: 1 |
|
- `max_steps`: -1 |
|
- `lr_scheduler_type`: linear |
|
- `lr_scheduler_kwargs`: {} |
|
- `warmup_ratio`: 0.0 |
|
- `warmup_steps`: 0 |
|
- `log_level`: passive |
|
- `log_level_replica`: warning |
|
- `log_on_each_node`: True |
|
- `logging_nan_inf_filter`: True |
|
- `save_safetensors`: True |
|
- `save_on_each_node`: False |
|
- `save_only_model`: False |
|
- `restore_callback_states_from_checkpoint`: False |
|
- `no_cuda`: False |
|
- `use_cpu`: False |
|
- `use_mps_device`: False |
|
- `seed`: 42 |
|
- `data_seed`: None |
|
- `jit_mode_eval`: False |
|
- `use_ipex`: False |
|
- `bf16`: False |
|
- `fp16`: True |
|
- `fp16_opt_level`: O1 |
|
- `half_precision_backend`: auto |
|
- `bf16_full_eval`: False |
|
- `fp16_full_eval`: False |
|
- `tf32`: None |
|
- `local_rank`: 0 |
|
- `ddp_backend`: None |
|
- `tpu_num_cores`: None |
|
- `tpu_metrics_debug`: False |
|
- `debug`: [] |
|
- `dataloader_drop_last`: False |
|
- `dataloader_num_workers`: 0 |
|
- `dataloader_prefetch_factor`: None |
|
- `past_index`: -1 |
|
- `disable_tqdm`: False |
|
- `remove_unused_columns`: True |
|
- `label_names`: None |
|
- `load_best_model_at_end`: True |
|
- `ignore_data_skip`: False |
|
- `fsdp`: [] |
|
- `fsdp_min_num_params`: 0 |
|
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} |
|
- `fsdp_transformer_layer_cls_to_wrap`: None |
|
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} |
|
- `deepspeed`: None |
|
- `label_smoothing_factor`: 0.0 |
|
- `optim`: adamw_torch |
|
- `optim_args`: None |
|
- `adafactor`: False |
|
- `group_by_length`: False |
|
- `length_column_name`: length |
|
- `ddp_find_unused_parameters`: None |
|
- `ddp_bucket_cap_mb`: None |
|
- `ddp_broadcast_buffers`: False |
|
- `dataloader_pin_memory`: True |
|
- `dataloader_persistent_workers`: False |
|
- `skip_memory_metrics`: True |
|
- `use_legacy_prediction_loop`: False |
|
- `push_to_hub`: True |
|
- `resume_from_checkpoint`: None |
|
- `hub_model_id`: Sami92/multilingual-e5-large-instruct-eu-parl-de |
|
- `hub_strategy`: every_save |
|
- `hub_private_repo`: False |
|
- `hub_always_push`: False |
|
- `gradient_checkpointing`: True |
|
- `gradient_checkpointing_kwargs`: None |
|
- `include_inputs_for_metrics`: False |
|
- `eval_do_concat_batches`: True |
|
- `fp16_backend`: auto |
|
- `push_to_hub_model_id`: multilingual-e5-large-instruct-eu-parl-de |
|
- `push_to_hub_organization`: None |
|
- `mp_parameters`: |
|
- `auto_find_batch_size`: False |
|
- `full_determinism`: False |
|
- `torchdynamo`: None |
|
- `ray_scope`: last |
|
- `ddp_timeout`: 1800 |
|
- `torch_compile`: False |
|
- `torch_compile_backend`: None |
|
- `torch_compile_mode`: None |
|
- `dispatch_batches`: None |
|
- `split_batches`: None |
|
- `include_tokens_per_second`: False |
|
- `include_num_input_tokens_seen`: False |
|
- `neftune_noise_alpha`: None |
|
- `optim_target_modules`: None |
|
- `batch_eval_metrics`: False |
|
- `batch_sampler`: batch_sampler |
|
- `multi_dataset_batch_sampler`: proportional |
|
|
|
</details> |
|
|
|
### Training Logs |
|
| Epoch | Step | Training Loss | loss | euro-parl-binary_max_ap | euro-parl-triplet_max_accuracy | |
|
|:------:|:-----:|:-------------:|:------:|:-----------------------:|:------------------------------:| |
|
| 0 | 0 | - | - | 0.9998 | 0.9998 | |
|
| 0.0302 | 500 | 0.0179 | - | - | - | |
|
| 0.0603 | 1000 | 0.0221 | - | - | - | |
|
| 0.0905 | 1500 | 0.0163 | - | - | - | |
|
| 0.1206 | 2000 | 0.0163 | 0.0144 | 0.9997 | 0.9996 | |
|
| 0.1508 | 2500 | 0.017 | - | - | - | |
|
| 0.1809 | 3000 | 0.0136 | - | - | - | |
|
| 0.2111 | 3500 | 0.0157 | - | - | - | |
|
| 0.2412 | 4000 | 0.0161 | 0.0135 | 0.9997 | 0.9996 | |
|
| 0.2714 | 4500 | 0.0188 | - | - | - | |
|
| 0.3015 | 5000 | 0.024 | - | - | - | |
|
| 0.3317 | 5500 | 0.0178 | - | - | - | |
|
| 0.3618 | 6000 | 0.0119 | 0.0114 | 0.9997 | 0.9996 | |
|
| 0.3920 | 6500 | 0.0132 | - | - | - | |
|
| 0.4221 | 7000 | 0.0117 | - | - | - | |
|
| 0.4523 | 7500 | 0.0127 | - | - | - | |
|
| 0.4824 | 8000 | 0.0112 | 0.0108 | 0.9997 | 0.9997 | |
|
| 0.5126 | 8500 | 0.0109 | - | - | - | |
|
| 0.5427 | 9000 | 0.0098 | - | - | - | |
|
| 0.5729 | 9500 | 0.0084 | - | - | - | |
|
| 0.6030 | 10000 | 0.0085 | 0.0098 | 0.9998 | 0.9997 | |
|
| 0.6332 | 10500 | 0.0083 | - | - | - | |
|
| 0.6633 | 11000 | 0.0081 | - | - | - | |
|
| 0.6935 | 11500 | 0.007 | - | - | - | |
|
| 0.7236 | 12000 | 0.0088 | 0.0088 | 0.9998 | 0.9997 | |
|
| 0.7538 | 12500 | 0.0065 | - | - | - | |
|
| 0.7839 | 13000 | 0.0066 | - | - | - | |
|
| 0.8141 | 13500 | 0.0067 | - | - | - | |
|
| 0.8443 | 14000 | 0.0059 | 0.0076 | 0.9998 | 0.9998 | |
|
|
|
|
|
### Framework Versions |
|
- Python: 3.10.12 |
|
- Sentence Transformers: 3.0.1 |
|
- Transformers: 4.41.2 |
|
- PyTorch: 2.3.1+cu121 |
|
- Accelerate: 0.32.0 |
|
- Datasets: 2.20.0 |
|
- Tokenizers: 0.19.1 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### MultipleNegativesRankingLoss |
|
```bibtex |
|
@misc{henderson2017efficient, |
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
year={2017}, |
|
eprint={1705.00652}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |