File size: 5,412 Bytes

---
license: apache-2.0
datasets:
- Helsinki-NLP/opus_paracrawl
- turuta/Multi30k-uk
language:
- uk
- en
metrics:
- bleu
library_name: peft
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
- translation
model-index:
- name: Dragoman
  results:
  - task:
      type: translation             # Required. Example: automatic-speech-recognition
      name: English-Ukrainian Translation             # Optional. Example: Speech Recognition
    dataset:
      type: facebook/flores          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: FLORES-101          # Required. A pretty name for the dataset. Example: Common Voice (French)
      config: eng_Latn-ukr_Cyrl      # Optional. The name of the dataset configuration used in `load_dataset()`. Example: fr in `load_dataset("common_voice", "fr")`. See the `datasets` docs for more info: https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset.name
      split: devtest        # Optional. Example: test
    metrics:
      - type: bleu         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 32.34       # Required. Example: 20.90
        name: Test BLEU         # Optional. Example: Test WER
widget:
- text: "[INST] who holds this neighborhood? [/INST]"
---

# Dragoman: English-Ukrainian Machine Translation Model

## Model Description

The Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned [Paracrawl](https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl) dataset and unsupervised data selection phase on [turuta/Multi30k-uk](https://huggingface.co/datasets/turuta/Multi30k-uk).

By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with **BLEU** `32.34`.


## Model Details

- **Developed by:** Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov 
- **Model type:** Translation model
- **Language(s):**  
  - Source Language: English
  - Target Language: Ukrainian
- **License:** Apache 2.0
  
## Model Use Cases

We designed this model for sentence-level English -> Ukrainian translation.
Performance on multi-sentence texts is not guaranteed, please be aware.


#### Running the model


```python
# pip install bitsandbytes transformers peft torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

config = PeftConfig.from_pretrained("lang-uk/dragoman")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=float16,
    bnb_4bit_use_double_quant=False,
)

model = MistralForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config
)
model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
)

input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

### Training Dataset and Resources

Training code: [lang-uk/dragoman](https://github.com/lang-uk/dragoman)  
Cleaned Paracrawl: [lang-uk/paracrawl_3m](https://huggingface.co/datasets/lang-uk/paracrawl_3m)  
Cleaned Multi30K: [lang-uk/multi30k-extended-17k](https://huggingface.co/datasets/lang-uk/multi30k-extended-17k)



### Benchmark Results against other models on FLORES-101 devset


| **Model**                                   | **BLEU** $\uparrow$ | **spBLEU** | **chrF** | **chrF++** |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Finetuned**                               |                     |             |          |            |
| Dragoman P, 10 beams                        | 30.38               | 37.93       | 59.49    | 56.41      |
| Dragoman PT, 10 beams                       | **32.34**           | **39.93**   | **60.72**| **57.82**  |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Zero shot and few shot**                  |                     |             |          |            |
| LLaMa-2-7B 2-shot                           | 20.1                | 26.78       | 49.22    | 46.29      |
| RWKV-5-World-7B 0-shot                      | 21.06               | 26.20       | 49.46    | 46.46      |
| gpt-4 10-shot                               | 29.48               | 37.94       | 58.37    | 55.38      |
| gpt-4-turbo-preview 0-shot                  | 30.36               | 36.75       | 59.18    | 56.19      |
| Google Translate 0-shot                     | 25.85               | 32.49       | 55.88    | 52.48      |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Pretrained**                              |                     |             |          |            |
| NLLB 3B, 10 beams                           | 30.46               | 37.22       | 58.11    | 55.32      |
| OPUS-MT, 10 beams                           | 32.2                | 39.76       | 60.23    | 57.38      |


## Citation

TBD