YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Hindi Tokenizer
A specialized tokenizer for Hindi language processing with a vocabulary size of 150,000 tokens.
Overview
This tokenizer is designed specifically for Hindi language processing tasks. With a vocabulary size of 150,000 tokens, it provides comprehensive coverage of Hindi words, subwords, and characters to effectively handle the unique characteristics of the Hindi language.
Features
- Large Vocabulary: 150,000 tokens covering common Hindi words, subwords, and characters
- Specialized for Hindi: Optimized for Hindi's morphological richness and linguistic structure
- Unicode Support: Full support for Devanagari script and Unicode characters
- Efficient Processing: Optimized for speed and memory usage with Hindi text
- Hugging Face Integration: Compatible with the Transformers library ecosystem
Installation
pip install hindi-tokenizer
# or
pip install git+https://huggingface.co/YOUR_USERNAME/hindi-tokenizer
Usage
Basic Usage
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/hindi-tokenizer")
# Example Hindi text
text = "नमस्ते, यह एक हिंदी पाठ का उदाहरण है।"
# Tokenize
tokens = tokenizer(text)
print(tokens)
# Decode
decoded = tokenizer.decode(tokens["input_ids"])
print(decoded)
Advanced Usage
# Batch processing
texts = [
"भारत एक विशाल देश है।",
"हिंदी भारत की प्रमुख भाषाओं में से एक है।"
]
# Tokenize with padding and truncation
tokens = tokenizer(texts, padding=True, truncation=True, max_length=128)
print(tokens)
Training Details
This tokenizer was trained on a large corpus of Hindi text using [BPE]. The training data includes:
Model Details
- Vocabulary Size: 150,000 tokens
- Tokenization Algorithm: [BPE]
Citation
If you use this tokenizer in your research, please cite:
@misc{hindi-tokenizer,
author = {Your Name},
title = {Hindi Tokenizer: A Large Vocabulary Tokenizer for Hindi NLP},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/hindi-tokenizer}}
}
Contact
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.