Odia SentencePiece Tokenizer Model

This repository hosts the SentencePiece tokenizer model for the Odia language, created to support the efficient tokenization of Odia text in NLP applications. The tokenizer was built using a diverse dataset of Odia text, ensuring comprehensive language coverage and accurate tokenization.

Model Details

Model Prefix: odia_tokenizers_test
Model Type: BPE (Byte-Pair Encoding)
Vocabulary Size: 50,000 tokens

File Structure

odia_tokenizers_test.model: SentencePiece tokenizer model file.
odia_tokenizers_test.vocab: Vocabulary file containing all token mappings.

Installation and Usage

To load and use this tokenizer model, make sure you have the sentencepiece package installed:

pip install sentencepiece

import sentencepiece as spm
from huggingface_hub import hf_hub_download

# Download the model file from Hugging Face
model_path = hf_hub_download(repo_id="shantipriya/OdiaTokenizer", filename="odia_tokenizers_test.model")

# Load the tokenizer model
sp = spm.SentencePieceProcessor()
sp.load(model_path)

# Sample text for tokenization
text = "ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।"

# Tokenize the text into pieces (subwords or tokens)
tokens = sp.encode_as_pieces(text)

# Tokenize the text into token IDs (integer representations of the tokens)
token_ids = sp.encode_as_ids(text)

# Print the tokenized output
print("Tokens:", tokens)
print("Token IDs:", token_ids)

Sample Tokenization

The model has been specifically trained on a diverse corpus of Odia text, ensuring high-quality tokenization results. Here’s an example of how the model tokenizes Odia sentences:

Input: ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।

Tokens: ['▁ଦୀପାବଳି', '▁ଏକ', '▁ଭାରତୀୟ', '▁ପର୍ବ', '▁।']

Token IDs: [1234, 5678, 91011, 121314, 1516] (example IDs)

Vocabulary Coverage

The vocabulary size was chosen to balance memory efficiency with language coverage, making it suitable for applications ranging from language modeling to text classification.

Vocabulary Statistics

Total Tokens: 50,000
Average Token Length: 6.46
Max Token Length: 16
Min Token Length: 1

Training and Configuration Details

The tokenizer was trained using the SentencePiece library with the following configurations:

Character Coverage: 99.995%
Input Sentence Size: 200 million sentences
Maximum Sentence Length: 4192 characters

Model Training Parameters:

shuffle_input_sentence=True
split_by_unicode_script=True
split_by_whitespace=True
byte_fallback=True

Intended Use

This model is intended for use in various NLP applications involving the Odia language, such as:

Language Modeling
Text Classification
Named Entity Recognition (NER)
Translation tasks involving Odia

License

This model is released under the cc-by-nc-sa-4.0 License.

Acknowledgments

This model was developed as part of a project to support low-resource language processing. Thanks to OdiaGenAI for providing the initial training data, which made this model possible.

Contributors

Shantipriya Parida
Sambit Sekhar
Sahil Khan

shantipriya
/

OdiaTokenizer