rishiraj/gemma-2-9b-bn

This repository extends the google/gemma-2-9b tokenizer by training it on Bengali text. The original tokenizer splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.

Token Information

Tokenizer Number of Tokens
google/gemma-2-9b 256,000
rishiraj/gemma-2-9b-bn 392,402

Why Fewer Tokens for Bengali?

While Bengali is very expressive and flexible, it hasn't undergone as much global influence as English in terms of absorbing new words from many different languages.

Tokenizer Comparison

Text:

আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি
Tokenizer Output
google/gemma-2-9b ['আ', 'মি', '▁এক', 'জন', '▁ভ', 'াল', 'ো', '▁', 'ছে', 'লে', '▁এবং', '▁আম', 'ি', '▁ফ', 'ু', 'ট', 'ব', 'ল', '▁খ', 'েল', 'তে', '▁প', 'ছ', 'ন্দ', '▁কর', 'ি']
rishiraj/gemma-2-9b-bn ['আমি', '▁একজন', '▁ভালো', '▁ছেলে', '▁এবং', '▁আমি', '▁ফুটবল', '▁খেলতে', '▁পছন্দ', '▁করি']

Usage

  1. Install dependencies:

    pip install transformers
    
  2. Load and use the tokenizer:

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn")
    tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি")
    print(tokens)
    
Downloads last month
18
Safetensors
Model size
9.24B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including rishiraj/gemma-2-9b-bn