Spaces:

rooftopcoder
/

NMT_demo

Sleeping

App Files Files Community

rooftopcoder commited on Mar 15

Commit

ce4167f

1 Parent(s): 9793cb6

Add requirements

Browse files

Files changed (3) hide show

README.md +122 -14
app.py +210 -0
requirements.txt +9 -0

README.md CHANGED Viewed

@@ -1,14 +1,122 @@
----
-title: NMT Demo
-emoji: 🐢
-colorFrom: yellow
-colorTo: indigo
-sdk: gradio
-sdk_version: 5.21.0
-app_file: app.py
-pinned: false
-license: apache-2.0
-short_description: NMT demo for BITS Assignment Group 54
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Neural Machine Translation for English-Hindi
+This project implements a Neural Machine Translation system for English-Hindi translation using the MarianMT model fine-tuned on 100k split of Samanantar, with a user-friendly Gradio interface.
+![NMT UI Screenshot](assets/nmt_ui_screenshot.png)
+## Features
+- Unidirectional translation between English and Hindi
+- User-friendly web interface built with Gradio
+- Example translations included
+- Built on Helsinki-NLP's MarianMT model
+## Installation
+### Local Setup with Virtual Environment
+1. Clone the repository:
+```bash
+git clone https://github.com/yourusername/NLPA_Assignment_2_Group_54.git
+cd NLPA_Assignment_2_Group_54
+```
+2. Create and activate a virtual environment:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows, use: venv\Scripts\activate
+```
+3. Install the required packages:
+```bash
+pip install -r requirements.txt
+```
+## Usage
+1. Make sure your virtual environment is activated
+2. Run the UI:
+```bash
+python nmt_ui.py
+```
+3. Open your browser and navigate to `http://localhost:7860`
+## Supported Language Pairs
+- English -> Hindi (using rooftopcoder/opus-mt-en-hi-samanantar-100k model)
+## Training the Model
+The `train.py` script is used to train the MarianMT model on the Samanantar dataset. The script performs the following steps:
+- Loads the Samanantar dataset (English-Hindi subset).
+- Splits the dataset into training and validation sets.
+- Tokenizes the dataset.
+- Sets up training arguments optimized for GPU.
+- Trains the model using the Hugging Face `Trainer` class.
+- Saves the trained model to the specified directory.
+- Uploads the trained model to the Hugging Face Hub.
+To train the model, run:
+```bash
+python train.py
+```
+## Testing the Model
+The `model_test.py` script is used to test the trained MarianMT model. The script performs the following steps:
+- Loads the trained model and tokenizer from the Hugging Face Hub.
+- Translates a sample input text from English to Hindi.
+- Prints the translated text.
+To test the model, run:
+```bash
+python model_test.py
+```
+## User Interface
+The `nmt_ui.py` script provides a Gradio-based user interface for translating text between English and Hindi. The interface includes options for transliteration of Romanized Hindi text to Devanagari script.
+To launch the interface, run:
+```bash
+python nmt_ui.py
+```
+## Model Information
+This project uses the MarianMT model from Hugging Face Transformers.
+### Notes:
+- The model supports English-Hindi translation.
+- Based on the Helsinki-NLP/opus-mt-en-hi model.
+- Optimized for English -> Hindi translation pairs.
+- Includes transliteration support for Romanized Hindi text.
+### Supported Features:
+- English -> Hindi translation.
+- Romanized Hindi -> Devanagari Hindi transliteration.
+### Examples of Transliteration:
+- "namaste" → "नमस्ते"
+- "aap kaise ho" → "आप कैसे हो"
+- "mera naam" → "मेरा नाम"
+## Project Structure
+```
+NLPA_Assignment_2_Group_54/
+├── nmt_ui.py        # Main application file with Gradio interface
+├── requirements.txt  # Python dependencies
+└── README.md        # Project documentation
+```
+## License
+MIT
+## Group Members
+- Shubhra J Gadhwala: 2023aa05750
+- Sandeep Kumar Yadav: 2023ab05047
+- Ravi Krishna Mayura: 2023ab05157
+- Satheesh Kumar G: 2023ab05041

app.py ADDED Viewed

	@@ -0,0 +1,210 @@

+import gradio as gr
+from huggingface_hub import HfFolder
+from transformers import MarianMTModel, MarianTokenizer
+from indic_transliteration import sanscript
+from indic_transliteration.sanscript import transliterate
+import torch  # Add this import at the top with other imports
+# Global variables to store models and tokenizers
+models = {}
+tokenizers = {}
+token = HfFolder.get_token()
+# Model configurations
+MODEL_CONFIGS = {
+    "en-hi": {
+        "model_path": "rooftopcoder/opus-mt-en-hi-samanantar-finetuned",
+        "name": "English to Hindi"
+    },
+    "hi-en": {
+        "model_path": "rooftopcoder/opus-mt-hi-en-samanantar-finetuned",
+        "name": "Hindi to English"
+    },
+    "en-mr": {
+        "model_path": "rooftopcoder/opus-mt-en-mr-samanantar-finetuned",
+        "name": "English to Marathi"
+    },
+    "mr-en": {
+        "model_path": "rooftopcoder/opus-mt-mr-en-samanantar-finetuned",
+        "name": "Marathi to English"
+    }
+}
+# Update language codes dictionary
+language_codes = {
+    "English": "en",
+    "Hindi": "hi",
+    "Marathi": "mr"
+}
+# Reverse dictionary for display purposes
+language_names = {v: k for k, v in language_codes.items()}
+def load_models():
+    try:
+        print("Loading models from local storage...")
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        print(f"Using device: {device}")
+        for direction, config in MODEL_CONFIGS.items():
+            print(f"Loading {config['name']} model...")
+            tokenizers[direction] = MarianTokenizer.from_pretrained(config["model_path"], token=token)
+            models[direction] = MarianMTModel.from_pretrained(config["model_path"], token=token).to(device)
+        print("All models loaded successfully!")
+        return True
+    except Exception as e:
+        print(f"Error loading models: {e}")
+        return False
+# Function to perform transliteration from English to Hindi
+def transliterate_text(text, from_scheme=sanscript.ITRANS, to_scheme=sanscript.DEVANAGARI):
+    """
+    Transliterates text from one script to another
+    Default is from ITRANS (Roman) to Devanagari (Hindi)
+    """
+    try:
+        return transliterate(text, from_scheme, to_scheme)
+    except Exception as e:
+        print(f"Transliteration error: {e}")
+        return text
+# Function to perform translation with MarianMT
+def translate(input_text, source_lang, target_lang):
+    """
+    Translates text using MarianMT models
+    """
+    direction = f"{source_lang}-{target_lang}"
+    if direction not in models or direction not in tokenizers:
+        return "Error: Unsupported language pair"
+    if not input_text.strip():
+        return "Error: Please enter some text to translate."
+    try:
+        device = next(models[direction].parameters()).device
+        tokens = tokenizers[direction](input_text, return_tensors="pt", padding=True, truncation=True)
+        tokens = {k: v.to(device) for k, v in tokens.items()}
+        translated = models[direction].generate(**tokens)
+        translated = translated.cpu()
+        output = tokenizers[direction].batch_decode(translated, skip_special_tokens=True)
+        return output[0]
+    except Exception as e:
+        print(f"Translation error: {e}")
+        return f"Error during translation: {str(e)}"
+# Helper function for handling the UI translation process
+def perform_translation(input_text, source_lang, target_lang):
+    """Wrapper function for the Gradio interface"""
+    source_code = language_codes[source_lang]
+    target_code = language_codes[target_lang]
+    # Handle transliteration for Hindi and Marathi
+    if source_code == "en" and target_code in ["hi", "mr"]:
+        common_indic_words = {
+            "hi": ["namaste", "dhanyavad", "kaise", "hai", "aap", "tum", "main"],
+            "mr": ["namaskar", "dhanyawad", "kase", "ahe", "tumhi", "mi"]
+        }
+        words = input_text.lower().split()
+        if any(word in common_indic_words.get(target_code, []) for word in words):
+            transliterated = transliterate_text(input_text)
+            if transliterated != input_text:
+                translation = translate(input_text, source_code, target_code)
+                return f"Transliterated: {transliterated}\n\nTranslated: {translation}"
+    return translate(input_text, source_code, target_code)
+# Create Gradio interface
+def create_interface():
+    with gr.Blocks(title="Neural Machine Translation - Indian Languages") as demo:
+        gr.Markdown("# Neural Machine Translation for Indian Languages")
+        gr.Markdown("Translate between English, Hindi, and Marathi using MarianMT models")
+        with gr.Row():
+            with gr.Column():
+                source_lang = gr.Dropdown(
+                    choices=list(language_codes.keys()),
+                    label="Source Language",
+                    value="English"
+                )
+                input_text = gr.Textbox(
+                    lines=5,
+                    placeholder="Enter text to translate...",
+                    label="Input Text"
+                )
+            with gr.Column():
+                target_lang = gr.Dropdown(
+                    choices=list(language_codes.keys()),
+                    label="Target Language",
+                    value="Hindi"
+                )
+                output_text = gr.Textbox(
+                    lines=5,
+                    label="Translated Text",
+                    placeholder="Translation will appear here..."
+                )
+        translate_btn = gr.Button("Translate", variant="primary")
+        transliterate_btn = gr.Button("Transliterate Only", variant="secondary")
+        # Event handlers
+        translate_btn.click(
+            fn=perform_translation,
+            inputs=[input_text, source_lang, target_lang],
+            outputs=[output_text],
+            api_name="translate"
+        )
+        # Direct transliteration handler (new)
+        def direct_transliterate(text):
+            if not text.strip():
+                return "Please enter text to transliterate"
+            return transliterate_text(text)
+        transliterate_btn.click(
+            fn=direct_transliterate,
+            inputs=[input_text],
+            outputs=[output_text],
+            api_name="transliterate"
+        )
+        # Examples for all language pairs
+        gr.Examples(
+            examples=[
+                ["Hello, how are you?", "English", "Hindi"],
+                ["नमस्ते, आप कैसे हैं?", "Hindi", "English"],
+                ["Hello, how are you?", "English", "Marathi"],
+                ["नमस्कार, तुम्ही कसे आहात?", "Marathi", "English"],
+            ],
+            inputs=[input_text, source_lang, target_lang],
+            fn=perform_translation,
+            outputs=output_text,
+            cache_examples=True
+        )
+        gr.Markdown("""
+        ## Model Information
+        This demo uses fine-tuned MarianMT models for translation between:
+        - English ↔️ Hindi
+        - English ↔️ Marathi
+        ### Features:
+        - Bidirectional translation support
+        - Transliteration support for romanized Indic text
+        - Optimized models for each language pair
+        """)
+    return demo
+# Launch the interface
+if __name__ == "__main__":
+    # Load all models before launching the interface
+    if load_models():
+        demo = create_interface()
+        demo.launch(share=False)
+    else:
+        print("Failed to load models. Please check the model paths and try again.")

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+gradio
+transformers[sentencepiece]
+torch
+sacremoses
+indic-transliteration
+datasets
+accelerate>=0.26.0
+evaluate
+sacrebleu