Terjman-Ultra / README.md

Update README.md

e55dc2e verified about 2 months ago

No virus

5.16 kB

	---
	license: cc-by-nc-4.0
	base_model: facebook/nllb-200-1.3B
	metrics:
	- bleu
	model-index:
	- name: Terjman-Ultra
	results: []
	datasets:
	- atlasia/darija_english
	language:
	- ar
	- en
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Terjman-Ultra (1.3B)

	Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques.
	It is a fine-tuned version of [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) on a the [darija_english](atlasia/darija_english) dataset enhanced with curated corpora ensuring high-quality and accurate translations.

	It achieves the following results on the evaluation set:
	- Loss: 2.7070
	- Bleu: 4.6998
	- Gen Len: 35.6088

	The finetuning was conducted using a A100-40GB and took 32 hours.

	Try it out on our dedicated [Terjman-Ultra Space](https://huggingface.co/spaces/atlasia/Terjman-Ultra) 🤗

	## Usage

	Using our model for translation is simple and straightforward.
	You can integrate it into your projects or workflows via the Hugging Face Transformers library.
	Here's a basic example of how to use the model in Python:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Ultra")
	model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Ultra")

	# Define your Moroccan Darija Arabizi text
	input_text = "Your english text goes here."

	# Tokenize the input text
	input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

	# Perform translation
	output_tokens = model.generate(**input_tokens)

	# Decode the output tokens
	output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

	print("Translation:", output_text)
	```

	## Example

	Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:

	Input: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

	Output: "أهلا صاحبي، تقدر تقولي مزحة بالدارجة المغربية؟ غادي نكون فرحان باش نسمعها منك!"

	## Limiations

	This version has some limitations mainly due to the Tokenizer.
	We're currently collecting more data with the aim of continous improvements.

	## Feedback

	We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly.
	If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

	## Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.03
	- num_epochs: 25

	## Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Bleu \| Gen Len \|
	\|:-------------:\|:-------:\|:-----:\|:---------------:\|:------:\|:-------:\|
	\| 3.203 \| 0.9999 \| 2242 \| 2.9015 \| 4.3057 \| 36.7548 \|
	\| 2.9175 \| 1.9998 \| 4484 \| 2.7602 \| 4.4286 \| 35.708 \|
	\| 2.8558 \| 2.9997 \| 6726 \| 2.7303 \| 4.629 \| 35.562 \|
	\| 2.8696 \| 4.0 \| 8969 \| 2.7195 \| 4.6537 \| 35.562 \|
	\| 2.8604 \| 4.9999 \| 11211 \| 2.7144 \| 4.6905 \| 35.5702 \|
	\| 2.8509 \| 5.9998 \| 13453 \| 2.7112 \| 4.599 \| 35.5427 \|
	\| 2.853 \| 6.9997 \| 15695 \| 2.7098 \| 4.6625 \| 35.5317 \|
	\| 2.8475 \| 8.0 \| 17938 \| 2.7081 \| 4.6901 \| 35.6419 \|
	\| 2.8192 \| 8.9999 \| 20180 \| 2.7082 \| 4.5474 \| 35.6391 \|
	\| 2.8395 \| 9.9998 \| 22422 \| 2.7077 \| 4.722 \| 35.6088 \|
	\| 2.8395 \| 10.9997 \| 24664 \| 2.7076 \| 4.752 \| 35.5868 \|
	\| 2.8362 \| 12.0 \| 26907 \| 2.7074 \| 4.6664 \| 35.562 \|
	\| 2.8673 \| 12.9999 \| 29149 \| 2.7072 \| 4.7004 \| 35.6639 \|
	\| 2.8465 \| 13.9998 \| 31391 \| 2.7076 \| 4.6715 \| 35.5923 \|
	\| 2.8281 \| 14.9997 \| 33633 \| 2.7075 \| 4.7045 \| 35.5647 \|
	\| 2.8191 \| 16.0 \| 35876 \| 2.7068 \| 4.7487 \| 35.6253 \|
	\| 2.874 \| 16.9999 \| 38118 \| 2.7076 \| 4.71 \| 35.6006 \|
	\| 2.8666 \| 17.9998 \| 40360 \| 2.7069 \| 4.6047 \| 35.6281 \|
	\| 2.8645 \| 18.9997 \| 42602 \| 2.7063 \| 4.6664 \| 35.6088 \|
	\| 2.8458 \| 20.0 \| 44845 \| 2.7070 \| 4.6552 \| 35.5813 \|
	\| 2.8501 \| 20.9999 \| 47087 \| 2.7074 \| 4.6919 \| 35.5647 \|
	\| 2.8309 \| 21.9998 \| 49329 \| 2.7074 \| 4.623 \| 35.6226 \|
	\| 2.854 \| 22.9997 \| 51571 \| 2.7072 \| 4.6495 \| 35.5978 \|
	\| 2.8407 \| 24.0 \| 53814 \| 2.7070 \| 4.6879 \| 35.5482 \|
	\| 2.8129 \| 24.9972 \| 56050 \| 2.7070 \| 4.6998 \| 35.6088 \|


	## Framework versions

	- Transformers 4.40.2
	- Pytorch 2.2.1+cu121
	- Datasets 2.19.1
	- Tokenizers 0.19.1