krinal
/

BertWordPieceTokenizer-hi

Token Classification

Inference Endpoints

Model card Files Files and versions Community

Model Details

BertWordPieceTokenizer

tokenizer for hindi language

Usage

from transformers import AutoTokenizer

hi_tokenizer = AutoTokenizer.from_pretrained('krinal/BertWordPieceTokenizer-hi')

hi_str = "आज का सूर्य देखो, कितना प्यारा, कितना शीतल है"

# encode text
encoded_str = hi_tokenizer.encode(hi_str)

# decode text
decoded_str = hi_tokenizer.decode(encoded_str)

Language

hi

Training

For training see Train BertWordPieceTokenizer

Dataset

trained on BHAAV (hi sentiment analysis dataset)
dataset source: Bhaav
Hindi text corpus (20,304 sentences)

Citation

@article{kumar2019bhaav,
  title={BHAAV-A Text Corpus for Emotion Analysis from Hindi Stories},
  author={Kumar, Yaman and Mahata, Debanjan and Aggarwal, Sagar and Chugh, Anmol and Maheshwari, Rajat and Shah, Rajiv Ratn},
  journal={arXiv preprint arXiv:1910.04073},
  year={2019}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Examples

Token Classification

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.