Terjman-Large (240M params)

Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. It is a fine-tuned version of Helsinki-NLP/opus-mt-tc-big-en-ar on a the darija_english dataset enhanced with curated corpora ensuring high-quality and accurate translations.

It achieves the following results on the evaluation set:

Loss: 3.2078
Bleu: 8.3292
Gen Len: 34.4959

The finetuning was conducted using a A100-40GB and took 23 hours.

Try it out on our dedicated Terjman-Large Space 🤗

Usage

Using our model for translation is simple and straightforward. You can integrate it into your projects or workflows via the Hugging Face Transformers library. Here's a basic example of how to use the model in Python:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large")
model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large")

# Define your Moroccan Darija Arabizi text
input_text = "Your english text goes here."

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Perform translation
output_tokens = model.generate(**input_tokens)

# Decode the output tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Translation:", output_text)

Example

Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:

Input: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

Output: "مرحبا صديقي، يمكن لك تقول لي نكتة في داريجا المغربية؟ سأكون سعيدا بسماعها منك!"

Limiations

This version has some limitations mainly due to the Tokenizer. We're currently collecting more data with the aim of continous improvements.

Feedback

We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 22
eval_batch_size: 22
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 88
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.03
num_epochs: 40

Training results

Training Loss	Epoch	Step	Validation Loss	Bleu	Gen Len
No log	0.9982	407	4.3938	4.6056	22.6033
5.1616	1.9988	815	3.7257	5.8319	30.9201
3.902	2.9994	1223	3.5214	6.7311	32.9091
3.5737	4.0	1631	3.4204	7.3684	32.1433
3.4576	4.9982	2038	3.3562	7.8632	34.5399
3.4576	5.9988	2446	3.3151	7.9739	35.3278
3.3833	6.9994	2854	3.2884	8.0825	35.8292
3.3358	8.0	3262	3.2681	8.2765	34.5427
3.3069	8.9982	3669	3.2517	8.1019	33.584
3.2769	9.9988	4077	3.2404	8.106	33.3802
3.2769	10.9994	4485	3.2342	8.3037	33.303
3.2777	12.0	4893	3.2284	8.0674	33.3967
3.2476	12.9982	5300	3.2226	8.2883	33.8154
3.2611	13.9988	5708	3.2189	8.3537	34.0413
3.2511	14.9994	6116	3.2159	8.1365	34.5014
3.2437	16.0	6524	3.2140	8.3549	34.0606
3.2437	16.9982	6931	3.2131	8.2507	34.303
3.2498	17.9988	7339	3.2116	8.2928	33.9945
3.2341	18.9994	7747	3.2105	8.337	33.7052
3.2403	20.0	8155	3.2098	8.3179	34.3526
3.2229	20.9982	8562	3.2094	8.3848	34.2039
3.2229	21.9988	8970	3.2090	8.2042	34.6529
3.2379	22.9994	9378	3.2086	8.4227	34.0275
3.2257	24.0	9786	3.2082	8.3515	34.3306
3.2526	24.9982	10193	3.2085	8.4089	34.4986
3.2206	25.9988	10601	3.2082	8.476	34.6226
3.2288	26.9994	11009	3.2083	8.4452	33.697
3.2288	28.0	11417	3.2080	8.29	34.0331
3.2251	28.9982	11824	3.2080	8.35	34.2948
3.2302	29.9988	12232	3.2078	8.4408	33.416
3.21	30.9994	12640	3.2079	8.2934	34.0854
3.2271	32.0	13048	3.2079	8.4573	33.3912
3.2271	32.9982	13455	3.2078	8.4055	34.2452
3.2428	33.9988	13863	3.2079	8.5107	34.5152
3.2303	34.9994	14271	3.2080	8.3734	34.2562
3.2129	36.0	14679	3.2079	8.3193	34.4628
3.2119	36.9982	15086	3.2082	8.4122	34.2121
3.2119	37.9988	15494	3.2078	8.3585	33.8843
3.2445	38.9994	15902	3.2079	8.3968	34.6722
3.2356	39.9264	16280	3.2078	8.3292	34.4959

Framework versions

Transformers 4.40.2
Pytorch 2.2.1+cu121
Datasets 2.19.1
Tokenizers 0.19.1

atlasia
/

Terjman-Large

Terjman-Large (240M params)

Usage

Example

Limiations

Feedback

Training hyperparameters

Training results

Framework versions

Model tree for atlasia/Terjman-Large

Dataset used to train atlasia/Terjman-Large

Spaces using atlasia/Terjman-Large 2

Collection including atlasia/Terjman-Large

Models

Evaluation results