English-to-Hindi Translation Model

Introduction
Model Overview
Installation
Quick Start
Detailed Usage
Customization and Training
Contributing
Citation
License
Contact
Acknowledgements

Introduction

This repository houses an advanced English-to-Hindi translation model leveraging Google's mT5-small powerful transformers. It is designed for accurate, context-aware translations, utilizing cutting-edge NLP techniques.

Model Overview

Base Model: Adapted from Google's mT5-small models
Language Pair: English to Hindi
Training Data: Large-scale, diverse dataset ensuring nuanced translation
Features: Customized tokenizer, handling of long sentences, and attention mechanisms for context preservation

Installation

pip install transformers tensorflow pandas dataset tqdm

git lfs install
git clone https://huggingface.co/nashit93/EnglishToHindiTranslationLongContext
cd EnglishToHindiTranslationLongContext

Quick Start

#!/usr/bin/env python
# coding: utf-8

import tensorflow as tf
import os
from transformers import TFMT5ForConditionalGeneration, MT5Tokenizer
import pandas as pd
from datasets import Dataset
from tqdm import tqdm
# Set up tokenizer
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-small")




# Function to load the model from the latest checkpoint
def load_model(checkpoint_dir):
    latest_checkpoint = None
    if os.path.exists(checkpoint_dir):
        checkpoints = [os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)]
        checkpoints = [d for d in checkpoints if os.path.isdir(d)]
        if checkpoints:
            latest_checkpoint = max(checkpoints, key=os.path.getmtime)

    if latest_checkpoint:
        print("Loading model from:", latest_checkpoint)
        return TFMT5ForConditionalGeneration.from_pretrained(latest_checkpoint)
    else:
        print("No checkpoint found, loading default model")
        return TFMT5ForConditionalGeneration.from_pretrained("google/mt5-small")

# Load the model
model = load_model('model_checkpoints-small-on-1mill-dp')

# Function to prepare text for prediction
def prepare_text(text, tokenizer, max_length=200):
    inputs = tokenizer.encode(text, return_tensors="tf", max_length=max_length, truncation=True)
    return inputs

# Function to generate prediction with adjustable settings
def generate_prediction(text, model, tokenizer, max_length=2000, num_beams=5):
    input_ids = prepare_text(text, tokenizer, max_length=max_length)
    output_ids = model.generate(
        input_ids,
        max_length=max_length,
        num_beams=num_beams,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


# Example Prediction
sample_text = "Hi , how are you?"
prediction = generate_prediction(sample_text, model, tokenizer)
print("Prediction:", prediction)

Detailed Usage

Explore advanced functionalities, including batch processing, handling different text lengths, and model fine-tuning.

Contributing

We warmly welcome contributions from the community to improve and enhance the English-to-Hindi translation model. If you're looking to contribute, here's how you can get started:

How to Contribute

Reporting Bugs and Issues

Identify and Report: If you find a bug or an issue, report it by creating a new issue on the GitHub repository.
Details: Provide as much detail as possible about the bug/issue, including steps to reproduce it, error messages, and the environment in which it occurred.

Suggesting Enhancements

Feature Requests: Ideas for new features or enhancements to existing ones are always appreciated.
Use Cases: Describe the use case for your feature, helping us understand the context and benefit.

Code Contributions

Fork the Repository: Start by forking the repository to your GitHub account.
Clone Locally: Clone your forked repository to your local machine.
Create a Branch: Create a branch for your feature or fix. Use a clear branch name that describes the purpose of your contribution.
Commit Changes: Make your changes in your branch and commit them with clear, detailed messages.
Write Tests: If applicable, write tests to ensure the robustness and reliability of your contribution.
Lint and Test Your Code: Make sure your code adheres to the project's style guidelines and passes all tests.
Pull Request: Submit a pull request to the main repository for review. Ensure you describe your changes and their impact.

Documentation

Improvements: Improvements to documentation, including README, code comments, and docstrings, are always valuable.
Clarity and Accuracy: Ensure the documentation is clear, accurate, and helps users understand and use the model effectively.

Review Process

After submitting your pull request:

Code Review: Maintainers will review your code and may request changes or provide feedback.
Discussion: Engage in discussions and revisions until your contribution meets the necessary standards.

Community Guidelines

Respect and Inclusivity: All contributions should respect community guidelines of inclusivity and mutual respect.
Code of Conduct: Adhere to the project's code of conduct to ensure a welcoming and open environment for everyone.

Staying Updated

Syncing Fork: Keep your fork synced with the main repository to stay updated with changes and avoid merge conflicts.

Need Help?

If you need help or have questions about contributing, feel free to reach out @ nashit93@gmail.com

Your contributions, big or small, make a significant impact on the project and the community. We look forward to your ideas, bug reports, feature requests, and code contributions! 🌟

Citation

If you use this model in your research, please cite it.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

Reach out to the maintainers at nashit93@gmail.com for any questions or collaboration inquiries.

Acknowledgements

Hugging Face and Google team for their transformer models
Dataset providers and contributors to the open-source NLP community

nashit93
/

EnglishToHindiTranslationLongContext