YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Hindi Tokenizer

A specialized tokenizer for Hindi language processing with a vocabulary size of 150,000 tokens.

Overview

This tokenizer is designed specifically for Hindi language processing tasks. With a vocabulary size of 150,000 tokens, it provides comprehensive coverage of Hindi words, subwords, and characters to effectively handle the unique characteristics of the Hindi language.

Features

  • Large Vocabulary: 150,000 tokens covering common Hindi words, subwords, and characters
  • Specialized for Hindi: Optimized for Hindi's morphological richness and linguistic structure
  • Unicode Support: Full support for Devanagari script and Unicode characters
  • Efficient Processing: Optimized for speed and memory usage with Hindi text
  • Hugging Face Integration: Compatible with the Transformers library ecosystem

Installation

pip install hindi-tokenizer
# or
pip install git+https://huggingface.co/YOUR_USERNAME/hindi-tokenizer

Usage

Basic Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/hindi-tokenizer")

# Example Hindi text
text = "नमस्ते, यह एक हिंदी पाठ का उदाहरण है।"

# Tokenize
tokens = tokenizer(text)
print(tokens)

# Decode
decoded = tokenizer.decode(tokens["input_ids"])
print(decoded)

Advanced Usage

# Batch processing
texts = [
    "भारत एक विशाल देश है।",
    "हिंदी भारत की प्रमुख भाषाओं में से एक है।"
]

# Tokenize with padding and truncation
tokens = tokenizer(texts, padding=True, truncation=True, max_length=128)
print(tokens)

Training Details

This tokenizer was trained on a large corpus of Hindi text using [BPE]. The training data includes:

Model Details

  • Vocabulary Size: 150,000 tokens
  • Tokenization Algorithm: [BPE]

Citation

If you use this tokenizer in your research, please cite:

@misc{hindi-tokenizer,
  author = {Your Name},
  title = {Hindi Tokenizer: A Large Vocabulary Tokenizer for Hindi NLP},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/hindi-tokenizer}}
}

Contact

[akshaw.ak4@gmail.com]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.