Oriya
sentencepiece
Edit model card

Odia SentencePiece Tokenizer Model

This repository hosts the SentencePiece tokenizer model for the Odia language, created to support the efficient tokenization of Odia text in NLP applications. The tokenizer was built using a diverse dataset of Odia text, ensuring comprehensive language coverage and accurate tokenization.

Model Details

  • Model Prefix: odia_tokenizers_test
  • Model Type: BPE (Byte-Pair Encoding)
  • Vocabulary Size: 50,000 tokens

File Structure

  • odia_tokenizers_test.model: SentencePiece tokenizer model file.
  • odia_tokenizers_test.vocab: Vocabulary file containing all token mappings.

Installation and Usage

To load and use this tokenizer model, make sure you have the sentencepiece package installed:

pip install sentencepiece
import sentencepiece as spm
from huggingface_hub import hf_hub_download

# Download the model file from Hugging Face
model_path = hf_hub_download(repo_id="shantipriya/OdiaTokenizer", filename="odia_tokenizers_test.model")

# Load the tokenizer model
sp = spm.SentencePieceProcessor()
sp.load(model_path)

# Sample text for tokenization
text = "ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।"

# Tokenize the text into pieces (subwords or tokens)
tokens = sp.encode_as_pieces(text)

# Tokenize the text into token IDs (integer representations of the tokens)
token_ids = sp.encode_as_ids(text)

# Print the tokenized output
print("Tokens:", tokens)
print("Token IDs:", token_ids)

Sample Tokenization

The model has been specifically trained on a diverse corpus of Odia text, ensuring high-quality tokenization results. Here’s an example of how the model tokenizes Odia sentences:

Input: ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।

Tokens: ['▁ଦୀପାବଳି', '▁ଏକ', '▁ଭାରତୀୟ', '▁ପର୍ବ', '▁।']

Token IDs: [1234, 5678, 91011, 121314, 1516] (example IDs)

Vocabulary Coverage

The vocabulary size was chosen to balance memory efficiency with language coverage, making it suitable for applications ranging from language modeling to text classification.

Vocabulary Statistics

  • Total Tokens: 50,000
  • Average Token Length: 6.46
  • Max Token Length: 16
  • Min Token Length: 1

Training and Configuration Details

The tokenizer was trained using the SentencePiece library with the following configurations:

  • Character Coverage: 99.995%
  • Input Sentence Size: 200 million sentences
  • Maximum Sentence Length: 4192 characters

Model Training Parameters:

  • shuffle_input_sentence=True
  • split_by_unicode_script=True
  • split_by_whitespace=True
  • byte_fallback=True

Intended Use

This model is intended for use in various NLP applications involving the Odia language, such as:

  • Language Modeling
  • Text Classification
  • Named Entity Recognition (NER)
  • Translation tasks involving Odia

License

This model is released under the cc-by-nc-sa-4.0 License.

Acknowledgments

This model was developed as part of a project to support low-resource language processing. Thanks to OdiaGenAI for providing the initial training data, which made this model possible.

Contributors

  • Shantipriya Parida
  • Sambit Sekhar
  • Sahil Khan
Downloads last month
6
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train shantipriya/OdiaTokenizer