Regional bengali text to IPA transcription - byT5-small

This is a fine-tuned version of the google/byt5-small for the task of generating IPA transcriptions from regional bengali text. This was done on the dataset of the competition “ভাষামূল: মুখের ভাষার খোঁজে“ by Bengali.AI.

Model performance:

  • Word error rate (wer): 0.0124279344454407
  • Char error rate (cer): 0.00427635805681347

Supported district tokens:

  • Kishoreganj
  • Narail
  • Narsingdi
  • Chittagong
  • Rangpur
  • Tangail

Loading & using the model

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-byt5small")
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-byt5small")

"""
  The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)

Using the pipeline

# Use a pipeline as a high-level helper
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-byt5small", device=device)


"""
  `texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=1024, batch_size=batch_size)

Credits

Done by S M Jishanul Islam, Sadia Ahmmed, Sahid Hossain Mustakim

Downloads last month
22
Safetensors
Model size
300M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including teamapocalypseml/regben2ipa-byt5small