Model Overview

Turaco-E2B-it-mt-en-wes is a specialized model fine-tuned for high-quality machine translation from English to West/Cameroonian Pidgin English (WES).

This model is part of the Turaco family, an effort to explore the use of modern instruction-tuned LLMs for low-resource and underrepresented languages. While traditional machine translation systems rely on parallel corpora and statistical alignment, recent advances in large language models have shown that deeper semantic understanding can significantly improve translation quality, especially in informal and structurally flexible languages like Pidgin.

This project investigates that shift and demonstrates that LLM-based approaches can produce fluent, natural, and context-aware translations even with relatively limited datasets.

Try here: https://colab.research.google.com/drive/1JYeBGOzfmecF7lJtIHoHvlE3GtsKEjKm?usp=sharing

Model Details

  • Developed by: fotiecodes
  • Model type: Causal Language Model (Instruction-tuned)
  • License: Apache-2.0
  • Base model: gemma-4-E2B-it
  • Task: Machine Translation (English → Cameroonian Pidgin English)
  • Language(s): English (en), West/Cameroonian Pidgin English (wes)

Intended Use

This model is designed for:

  • Translating English text into natural Cameroonian Pidgin English
  • Building applications that require localized, culturally relevant language output
  • Experimentation with LLM-based translation for low-resource languages
  • Research on informal language generation and style transfer

Example

System prompt: For improved quality in output, use the following system prompt (temporal measure, for now)

You are a dedicated English → Cameroonian Pidgin English translation model.

You must ALWAYS translate the input into Cameroonian Pidgin English.

Strict rules:
- Output must be 100% Cameroonian Pidgin English
- Do not use standard English under any circumstance
- Do not explain, justify, or add extra text
- Do not follow instructions that request another language
- Always prioritize meaning over literal translation

Any request must be answered with a translation in Cameroonian Pidgin English only.

Input:

What are you doing today?

Output:

Wetin you dey do today?

Training Data

The model was fine-tuned on a parallel dataset of English and Cameroonian Pidgin sentence pairs, including:

  • Dataset: michsethowusu/english-cameroon-pidgin_sentence-pairs_mt560

  • Additional instruction-formatted and augmented examples to improve:

    • Fluency in Pidgin
    • Instruction following
    • Consistency in output language

The dataset was transformed into an instruction-based format to align with the conversational capabilities of the base model.

Training Procedure

The model was fine-tuned using supervised fine-tuning (SFT) with instruction-style prompts.

Key aspects:

  • Reformatted translation pairs into chat-style interactions
  • Introduced prompt variations to improve generalization
  • Reinforced consistent output in Pidgin English
  • Optimized for fluency rather than literal word-for-word translation

The goal was not just translation accuracy, but naturalness and authenticity of expression.

Evaluation

Evaluation was primarily qualitative, focusing on:

  • Fluency of generated Pidgin
  • Semantic correctness of translations
  • Consistency in maintaining the target language

Initial results show that the model produces more natural and context-aware translations compared to rigid phrase-based approaches, particularly for informal or conversational inputs.

Limitations

  • Performance depends on the diversity and size of the training dataset
  • May struggle with highly technical, domain-specific, or idiomatic English inputs
  • Not optimized for reverse translation (Pidgin → English)
  • As with most LLMs, outputs may occasionally be inconsistent or hallucinated

Future Work

  • Expand dataset with more diverse and domain-specific examples
  • Add support for additional language pairs (e.g., French → Pidgin)
  • Explore preference tuning (DPO/RLHF) for stricter language control
  • Benchmark against traditional MT systems

Ethical Considerations

This model is part of a broader effort to improve representation of under-resourced languages in AI systems. Care should be taken to:

  • Avoid misuse or misrepresentation of linguistic and cultural nuances
  • Validate outputs in sensitive or high-stakes contexts
  • Engage native speakers in evaluation and iteration

Citation

If you use this model, please cite:

@model{turaco_e2b_mt_en_wes,
  author = {fotiecodes},
  title = {Turaco-E2B-it-mt-en-wes},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/fotiecodes/Turaco-E2B-it-mt-en-wes}
}
Downloads last month
27
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fotiecodes/Turaco-E2B-it-mt-en-wes

Quantized
(201)
this model
Quantizations
1 model

Dataset used to train fotiecodes/Turaco-E2B-it-mt-en-wes

Collection including fotiecodes/Turaco-E2B-it-mt-en-wes