Lahad's picture
Update README.md
7c7f5fa verified
metadata
library_name: transformers
license: mit
datasets:
  - galsenai/centralized_wolof_french_translation_data
language:
  - wo
  - fr
base_model:
  - facebook/nllb-200-distilled-600M
pipeline_tag: translation

Model Card: NLLB-200 French-Wolof(🇫🇷↔️🇸🇳) Translation Model

Model Details

Model Description

A fine-tuned version of Meta's NLLB-200 (600M distilled) model specialized for French to Wolof translation. This model was trained to improve accessibility of content between French and Wolof languages.

  • Developed by: Lahad
  • Model type: Sequence-to-Sequence Translation Model
  • Language(s): French (fr_Latn) ↔️ Wolof (wol_Latn)
  • License: CC-BY-NC-4.0
  • Finetuned from model: facebook/nllb-200-distilled-600M

Model Sources

Uses

Direct Use

  • Text translation between French and Wolof
  • Content localization
  • Language learning assistance
  • Cross-cultural communication

Out-of-Scope Use

  • Commercial use without proper licensing
  • Translation of highly technical or specialized content
  • Legal or medical document translation where professional human translation is required
  • Real-time speech translation

Bias, Risks, and Limitations

  1. Language Variety Limitations:

    • Limited coverage of regional Wolof dialects
    • May not handle cultural nuances effectively
  2. Technical Limitations:

    • Maximum context window of 128 tokens
    • Reduced performance on technical/specialized content
    • May struggle with informal language and slang
  3. Potential Biases:

    • Training data may reflect cultural biases
    • May perform better on standard/formal language

Recommendations

  • Use for general communication and content translation
  • Verify translations for critical communications
  • Consider regional language variations
  • Implement human review for sensitive content
  • Test translations in intended context before deployment

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Lahad/nllb200-francais-wolof")
model = AutoModelForSeq2SeqLM.from_pretrained("Lahad/nllb200-francais-wolof")

# Translation function
def translate(text, max_length=128):
    inputs = tokenizer(
        text,
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("wol_Latn"),
        max_length=max_length
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Training Data

  • Dataset: galsenai/centralized_wolof_french_translation_data
  • Split: 80% training, 20% testing
  • Format: JSON pairs of French and Wolof translations

Training Procedure

Preprocessing

  • Dynamic tokenization with padding
  • Maximum sequence length: 128 tokens
  • Source/target language tags: fr_Latn/wol_Latn

Training Hyperparameters

  • Learning rate: 2e-5
  • Batch size: 8 per device
  • Training epochs: 3
  • FP16 training: Enabled
  • Evaluation strategy: Per epoch

Evaluation

Testing Data, Factors & Metrics

  • Testing Data: 20% of dataset
  • Metrics:
    • Cloud Provider:
  • Evaluation Factors:
    • Translation accuracy
    • Semantic preservation
    • Grammar correctness

Environmental Impact

  • Hardware Type: NVIDIA T4 GPU
  • Hours used: 5
  • Cloud Provider: [Not Specified]
  • Compute Region: [Not Specified]
  • Carbon Emitted: [Not Calculated]

Technical Specifications

Model Architecture and Objective

  • Architecture: NLLB-200 (Distilled 600M version)
  • Objective: Neural Machine Translation
  • Parameters: 600M
  • Context Window: 128 tokens

Compute Infrastructure

  • Training Hardware: NVIDIA T4 GPU
  • Training Time: 5 hours
  • Software Framework: Hugging Face Transformers

Model Card Contact

For questions about this model, please create an issue on the model's Hugging Face repository.