--- datasets: - abdulhade/TextCorpusKurdish_asosoft language: - ku - en library_name: adapter-transformers license: mit metrics: - accuracy - bleu - meteor pipeline_tag: translation --- # Kurdish-English Machine Translation with Transformers This repository focuses on fine-tuning a Kurdish-English machine translation model using Hugging Face's `transformers` library with MarianMT. The model is trained on a custom parallel corpus with a detailed pipeline that includes data preprocessing, bidirectional training, evaluation, and inference. This model is a product of the AI Center of Kurdistan University. ## Table of Contents - [Introduction](#introduction) - [Requirements](#requirements) - [Setup](#setup) - [Pipeline Overview](#pipeline-overview) - [Data Preparation](#data-preparation) - [Training SentencePiece Tokenizer](#training-sentencepiece-tokenizer) - [Model and Tokenizer Setup](#model-and-tokenizer-setup) - [Tokenization and Dataset Preparation](#tokenization-and-dataset-preparation) - [Training Configuration](#training-configuration) - [Evaluation and Metrics](#evaluation-and-metrics) - [Inference](#inference) - [Results](#results) - [License](#license) ## Introduction This project fine-tunes a MarianMT model for Kurdish-English translation on a custom parallel corpus. Training is configured for bidirectional translation, enabling model use in both language directions. ## Requirements - Python 3.8+ - Hugging Face Transformers - Datasets library - SentencePiece - PyTorch 1.9+ - CUDA (for GPU support) ## Setup 1. Clone the repository and install dependencies. 2. Ensure GPU availability. 3. Prepare your Kurdish-English corpus in CSV format. ## Pipeline Overview ### Data Preparation 1. **Corpus**: A Kurdish-English parallel corpus in CSV format with columns `Source` (Kurdish) and `Target` (English). 2. **Path Definition**: Specify the corpus path in the configuration. ### Training SentencePiece Tokenizer - **Vocabulary Size**: 32,000 - **Source Data**: The tokenizer is trained on both the primary Kurdish corpus and the English dataset to create shared subword tokens. ### Model and Tokenizer Setup - **Model**: `Helsinki-NLP/opus-mt-en-mul` pre-trained MarianMT model. - **Tokenizer**: MarianMT tokenizer aligned with the model, with source and target languages set dynamically. ### Tokenization and Dataset Preparation - **Train-Validation Split**: 90% train, 10% validation split. - **Maximum Sequence Length**: 128 tokens for both source and target sequences. - **Bidirectional Tokenization**: Prepare tokenized sequences for both Kurdish-English and English-Kurdish translation. ### Training Configuration - **Learning Rate**: 2e-5 - **Batch Size**: 4 (per device, for both training and evaluation) - **Weight Decay**: 0.01 - **Evaluation Strategy**: Per epoch - **Epochs**: 3 - **Logging**: Logs saved every 100 steps, with TensorBoard logging enabled - **Output Directory**: `./results` - **Device**: GPU 1 explicitly set ### Evaluation and Metrics The following metrics are computed on the validation dataset: - **BLEU**: Measures translation quality based on precision and recall of n-grams. - **METEOR**: Considers synonymy and stem matches. - **BERTScore**: Evaluates semantic similarity with BERT embeddings. ### Inference Inference includes bidirectional translation capabilities: - **Source to Target**: English to Kurdish translation. - **Target to Source**: Kurdish to English translation. ## Results The fine-tuned model and tokenizer are saved to `./fine-tuned-marianmt`, including evaluation metrics across BLEU, METEOR, and BERTScore. """ # Write the content to README.md file_path = "/mnt/data/README.md" with open(file_path, "w") as readme_file: readme_file.write(readme_content) file_path