Edit model card

Assamese Tokenizer (50K Vocabulary)

Model Details

This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.

Repository Details

  • Repository Name: tamang0000/assamese-tokenizer-50k
  • Tokenizer Vocabulary Size: 50,000 tokens
  • Training Dataset: CC-100 Multilingual Dataset (Assamese Language Subset)
  • Model Type: Tokenizer
  • Framework: Hugging Face Transformers
  • License: MIT License

Tokenizer Usage

You can load and use this tokenizer with the Hugging Face transformers library. Below are the steps to load and use the tokenizer in your projects.

Training Details

  • Dataset: The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
  • Vocabulary Size: 50,000 tokens.
  • Normalization: Includes normalization steps such as lowercasing and stripping accents.
Downloads last month

-

Downloads are not tracked for this model. How to track
Unable to determine this model’s pipeline type. Check the docs .

Space using tamang0000/assamese-tokenizer-50k 1