Update README.md
Browse files
README.md
CHANGED
@@ -45,7 +45,7 @@ language:
|
|
45 |
|
46 |
# Salamandra Model Card
|
47 |
|
48 |
-
SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base. The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data. The model is proficent in 37 european languages and support translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post
|
49 |
|
50 |
> [!WARNING]
|
51 |
> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
|
@@ -83,8 +83,7 @@ You can translate between the following 37 languages:
|
|
83 |
|
84 |
Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokmål, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
|
85 |
|
86 |
-
|
87 |
-
The instruction-following models use the commonly adopted ChatML template:
|
88 |
|
89 |
```
|
90 |
<|im_start|>system
|
@@ -152,7 +151,7 @@ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the
|
|
152 |
|
153 |
### Post-edition
|
154 |
|
155 |
-
For post-
|
156 |
|
157 |
```python
|
158 |
source = 'Catalan'
|
@@ -318,7 +317,7 @@ The non-public portion of this dataset was jointly created by BSC, HiTZ, and CiT
|
|
318 |
|
319 |
## Evaluation
|
320 |
|
321 |
-
Below are the evaluation results on the Flores+200 devtest set, compared against the state-of-the-art MADLAD400-7B model ([Kudugunta, S., et al.](https://arxiv.org/abs/2309.04662)). These results cover translation directions between CA-XX, ES-XX, EN-XX, as well as XX-CA, XX-ES, and XX-EN. The metrics have been computed excluding Asturian, Aranese, and Aragonese as we report them separately. The evaluation was conducted using [MT Lens](https://github.com/langtech-bsc/mt-evaluation) following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). We report the following metrics:
|
322 |
|
323 |
<details>
|
324 |
<summary>Click to show metrics details</summary>
|
|
|
45 |
|
46 |
# Salamandra Model Card
|
47 |
|
48 |
+
SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base. The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data. The model is proficent in 37 european languages and support translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, machine translation evaluation, multi-reference-translation, named-entity-recognition and context-aware translation.
|
49 |
|
50 |
> [!WARNING]
|
51 |
> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
|
|
|
83 |
|
84 |
Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokmål, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
|
85 |
|
86 |
+
The instruction-following model use the commonly adopted ChatML template:
|
|
|
87 |
|
88 |
```
|
89 |
<|im_start|>system
|
|
|
151 |
|
152 |
### Post-edition
|
153 |
|
154 |
+
For post-editing tasks you can try using the following prompt template:
|
155 |
|
156 |
```python
|
157 |
source = 'Catalan'
|
|
|
317 |
|
318 |
## Evaluation
|
319 |
|
320 |
+
Below are the evaluation results on the [Flores+200 devtest set](https://huggingface.co/datasets/openlanguagedata/flores_plus), compared against the state-of-the-art MADLAD400-7B model ([Kudugunta, S., et al.](https://arxiv.org/abs/2309.04662)) and SalamandraTA-7b-base model. These results cover translation directions between CA-XX, ES-XX, EN-XX, as well as XX-CA, XX-ES, and XX-EN. The metrics have been computed excluding Asturian, Aranese, and Aragonese as we report them separately. The evaluation was conducted using [MT Lens](https://github.com/langtech-bsc/mt-evaluation) following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). We report the following metrics:
|
321 |
|
322 |
<details>
|
323 |
<summary>Click to show metrics details</summary>
|