Hindi Tokenizer

A specialized tokenizer for Hindi language processing with a vocabulary size of 150,000 tokens.

Overview

This tokenizer is designed specifically for Hindi language processing tasks. With a vocabulary size of 150,000 tokens, it provides comprehensive coverage of Hindi words, subwords, and characters to effectively handle the unique characteristics of the Hindi language.

Features

Large Vocabulary: 150,000 tokens covering common Hindi words, subwords, and characters
Specialized for Hindi: Optimized for Hindi's morphological richness and linguistic structure
Unicode Support: Full support for Devanagari script and Unicode characters
Efficient Processing: Optimized for speed and memory usage with Hindi text
Hugging Face Integration: Compatible with the Transformers library ecosystem

Installation

pip install hindi-tokenizer
# or
pip install git+https://huggingface.co/YOUR_USERNAME/hindi-tokenizer

Usage

Basic Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/hindi-tokenizer")

# Example Hindi text
text = "नमस्ते, यह एक हिंदी पाठ का उदाहरण है।"

# Tokenize
tokens = tokenizer(text)
print(tokens)

# Decode
decoded = tokenizer.decode(tokens["input_ids"])
print(decoded)

Advanced Usage

# Batch processing
texts = [
    "भारत एक विशाल देश है।",
    "हिंदी भारत की प्रमुख भाषाओं में से एक है।"
]

# Tokenize with padding and truncation
tokens = tokenizer(texts, padding=True, truncation=True, max_length=128)
print(tokens)

Training Details

This tokenizer was trained on a large corpus of Hindi text using [BPE]. The training data includes:

[https://www.kaggle.com/datasets/rushikeshdarge/hindi-text-corpus]

Model Details

Vocabulary Size: 150,000 tokens
Tokenization Algorithm: [BPE]

Citation

If you use this tokenizer in your research, please cite:

@misc{hindi-tokenizer,
  author = {Your Name},
  title = {Hindi Tokenizer: A Large Vocabulary Tokenizer for Hindi NLP},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/hindi-tokenizer}}
}

Contact

[akshaw.ak4@gmail.com]