File size: 2,669 Bytes
cfd6a33
18d9156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfd6a33
18d9156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: apache-2.0
language:
- bn
metrics:
- wer
- cer
tags:
- seq2seq
- ipa
- bengali
- byt5
widget:
- text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম।
  example_title: Narail Text
- text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ।
  example_title: Rangpur Text
- text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা!
  example_title: Chittagong Text
- text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।
  example_title: Kishoreganj Text
- text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না।
  example_title: Narsingdi Text
- text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ?
  example_title: Tangail Text
---

# Regional bengali text to IPA transcription - umt5-base


This is a fine-tuned version of the [google/umt5-base](https://huggingface.co/google/umt5-base) for the task of generating IPA transcriptions from regional bengali text. 
This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI.

Scores achieved till now (test scores):
- **Word error rate (wer)**: 0.02390405721962450
- **Char error rate (cer)**: 0.01011514943093060

Supported district tokens:
- Kishoreganj
- Narail
- Narsingdi
- Chittagong
- Rangpur
- Tangail

---

## Loading & using the model
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
"""
  The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)
```


## Using the pipeline
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-umt5base", device=device)
"""
  `texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=512, batch_size=batch_size)
```

## Credits
Done by [S M Jishanul Islam](https://huggingface.co/smji), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15)