File size: 2,080 Bytes
262b268
b8d6e9e
 
 
 
 
1a1a0be
262b268
 
db4177a
b94c263
8433e2b
 
8ecfd1a
b94c263
db4177a
 
b94c263
db4177a
 
 
 
 
 
 
 
 
 
 
 
 
b94c263
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
---
language:
  - en
  - ku
tags:
  - translation
  - ctranslate2
license: cc-by-nc-4.0
---
# Introduction
This is a English - Kurdish machine translation model

# Demo
You can try this translator in [Space](https://huggingface.co/spaces/lingvanex/lingvanex_en-ku_translator)

# Metrics
* Model performance measures: English - Kurdish model was evaluated using SacreBLEU, TER, and chrF++ metrics widely adopted by machine translation community. 

# Evaluation Data
* Datasets: Lingvanex dataset is described in Section 4
* Motivation: We used Flores-200 as it provides full evaluation coverage of the languages in NLLB-200
* Preprocessing: Sentence-split raw text data was preprocessed using SentencePiece. The SentencePiece model is released along with NLLB-200.

# Training Data
* We used parallel multilingual data from a variety of sources to train the model. We provide detailed report on data selection and construction process in Section 5 in the paper. We also used monolingual data constructed from Common Crawl. We provide more details in Section 5.2.


# Intended Use
* Primary intended uses: NLLB-200 is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages. Information on how to - use the model can be found in Fairseq code repository along with the training code and references to evaluation and training data.
* Primary intended users: Primary users are researchers and machine translation research community.
* Out-of-scope use cases: NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.