File size: 11,902 Bytes
942a1ad
66e380a
 
 
 
942a1ad
66e380a
61afd48
66e380a
2a012cd
942a1ad
66e380a
 
 
 
 
 
 
 
 
 
 
 
 
dda0a8d
66e380a
 
 
 
 
 
 
 
 
 
 
da088dd
66e380a
 
 
 
 
 
 
 
 
 
 
 
 
dda0a8d
66e380a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dda0a8d
 
5b2ff8c
 
 
 
 
dda0a8d
66e380a
5b2ff8c
 
 
 
 
 
66e380a
 
 
5b2ff8c
 
 
 
 
66e380a
 
 
5b2ff8c
 
 
 
 
66e380a
 
 
5b2ff8c
 
 
 
 
 
 
 
 
 
 
 
 
66e380a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dda0a8d
66e380a
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
language: fa
tags:
- persian
- mobilebert
license: apache-2.0
pipeline_tag: fill-mask
mask_token: '[MASK]'
widget:
  - text: 'در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم.'
---

<p align="center">
  
# <img src="https://avatars.githubusercontent.com/u/75159340?s=60&v=4" alt="Logo" width="50" height="50"> <a href="https://lifewebco.com"> Lifeweb </a>

</p>

### Shiraz Language Model
Welcome to Shiraz, the repository for Lifeweb's language model.
First versions of our models are all trained on our own dataset called **Divan** with more than **164 million documents** and more than **10B tokens** which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model!


# Use Model
You can easily access the models using the sample code provided below.

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline
# v1.0
model_name = "lifeweb-ai/shiraz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

text = "در همین لحظه که شما مشغول خواندن این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."
print(tokenizer.tokenize(text))

# ['در', 'همین', 'لحظه', 'که', 'شما', 'مشغول', 'خواندن', 'این', 'متن', 'هستید،', 'میلیون', '[zwnj]', 'ها', 'دیتا', 'در', 'فضای', 'انلاین', 'در', 'حال', 'تولید', 'است', '.', 'ما', 'در', 'لایف', 'وب', 'به', 'جمع', '[zwnj]', 'اوری', '##،', 'پردازش', 'و', 'تحلیل', 'این', 'کلان', 'داده', '(', 'big', 'data', ')', 'می', '[zwnj]', 'پردازیم', '.', '.']

# fill mask task
text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."

classifier = FillMaskPipeline(model=model, tokenizer=tokenizer)
result = classifier(text)
print(result[0])
#{'score': 0.3584367036819458, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'}
```


# Results

The **Shiraz** is evaluated on three downstream NLP tasks comprising **NER**, **Sentiment Analysis**, and **Emotion Detection**. Shiraz is considerably faster, and its accuracy remains highly competitive without compromising much on speed. According to [**MobileBERT paper**](https://arxiv.org/pdf/2004.02984.pdf), this model is 4.3× smaller and 5.5× faster than BERT-base. 


Obvious from the table below, you can find the colab codes for each task to use as a tutorial besides the macro F1 score. 

<table class="tg">
<thead>
  <tr>
    <th class="tg-c3ow">Model</th>
    <th class="tg-c3ow" colspan="2">NER</th>
    <th class="tg-c3ow" colspan="2">Sentiment</th>
    <th class="tg-c3ow" colspan="1">Emotion</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-c3ow">Arman</td>
    <td class="tg-c3ow">Peyma</td>
    <td class="tg-c3ow"> Sentipers (multi) </td>
    <td class="tg-c3ow"> Snappfood </td>
    <td class="tg-c3ow"> Arman </td>
  </tr>
  <tr>
    <td class="tg-0pky">lifeweb-ai/tehran</td>
    <td class="tg-c3ow"><strong> 71.87% <br>
    <td class="tg-c3ow"><strong> 90.79% <br>
    <td class="tg-c3ow"><strong> 63.75% <br>
    <td class="tg-c3ow"><strong> 88.74% <br>
    <td class="tg-c3ow"><strong> 77.73% <br>
  </tr>
  <tr>
    <td class="tg-0pky">lifeweb-ai/shiraz</td>
    <td class="tg-c3ow"> 67.62% <br><a href="https://colab.research.google.com/drive/15PUAGy9MUSBO3LPdMJ4h9DVKibREv9oY"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 86.24% <br><a href="https://colab.research.google.com/drive/1lzVsDpl6_WhxsW8mtUNjhXzQPBMNL6Q2"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 59.17% <br><a href="https://colab.research.google.com/drive/1L87oYYDBY1Fi0GGvjRGSdSk2rZ5vshUV"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 88.01% <br><a href="https://colab.research.google.com/drive/1-S-VE83IGGGS9lZVydVKa4SnxshFSvT6"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 66.97% <br><a href="https://colab.research.google.com/drive/12SpUEsOP1I2cCp-gQsifONyu9yDUGuKG"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
  </tr>
  <tr>
    <td class="tg-0pky">HooshvareLab/bert-fa-zwnj-base</td>
    <td class="tg-c3ow"> 67.49% <br><a href="https://colab.research.google.com/drive/1HApEhtOm2p0ra1NwHLbptaxNeKqXC_TM"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 85.73% <br><a href="https://colab.research.google.com/drive/1e67UzkbX1HPgayfi8Z1rNNy79AACr1lV"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 59.61% <br><a href="https://colab.research.google.com/drive/1pub2tq2Qvb08s2w4cE-AfOwzWYXH6rsM"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 87.58% <br><a href="https://colab.research.google.com/drive/1PyjCTXFB-SXfrG8Bjjpr9py39Q9J8oGZ"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 59.27% <br><a href="https://colab.research.google.com/drive/13jUeb2694W9SHWNYa1KMbvmeCAhnDpv0"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
  </tr>
  <tr>
    <td class="tg-0pky">HooshvareLab/roberta-fa-zwnj-base</td>
    <td class="tg-c3ow"> 69.73% <br><a href="https://colab.research.google.com/drive/1a0o6Mx3jlK8ItWdIQgThM81hlSTE6sur"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 86.21% <br><a href="https://colab.research.google.com/drive/1fMXN5OeWmeLlLnG1gdznvq9ruBmP3UTv"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 56.23% <br><a href="https://colab.research.google.com/drive/18OzPDKH1mB6-uDVmN0WWZz_etwrsZ_A3"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 87.19% <br><a href="https://colab.research.google.com/drive/1E-rfJYZmid3a-bEpskU_j_3S4q_SQmGH"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 57.96% <br><a href="https://colab.research.google.com/drive/1NRphgik9y0fmZP_7MDUjMq6zTP2AfTMj"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
  </tr>
  <tr>
    <td class="tg-0pky">ViraIntelligentDataMining/AriaBERT</td>
    <td class="tg-c3ow"> 69.12% <br><a href="https://colab.research.google.com/drive/1s0aSjPYntinkupgaAiGZIvwzKXWjNHgA"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 87.15% <br><a href="https://colab.research.google.com/drive/1qPy0nFHC8bYj9OskUyksF0gQRQ6hRgbT"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 59.26% <br><a href="https://colab.research.google.com/drive/1P9YaP9Fem5pSlJqPxP2jG2IBq9TsLbaz"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 87.96% <br><a href="https://colab.research.google.com/drive/1wuGFELbqx0eE1cvmPZRgfklTTa3SkpyW"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 69.11% <br><a href="https://colab.research.google.com/drive/1UINarSRMy4yKbSeXKgSUf84IvJh-JC4q"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="" width="87" height="15"></a></td>
  </tr>
  <tr>
    <td class="tg-0pky">sbunlp/fabert</td>
    <td class="tg-c3ow"> 71.23% <br><a href="https://colab.research.google.com/drive/1NHUG8GdGEx1R76jr1MBC8sqDFWdsAxQk"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 88.53% <br><a href="https://colab.research.google.com/drive/1I6Nl9W_Br-WVV4odUcw0um_-dypjFyrp"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 58.51% <br><a href="https://colab.research.google.com/drive/1jdLotilq7hedyQ8x9aTUdgJ2IP-EDLWv"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 88.60% <br><a href="https://colab.research.google.com/drive/1DsIFzDrC_HNDaQyltJtiT3DjGA9blg_B"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
    <td class="tg-c3ow"> 72.65% <br><a href="https://colab.research.google.com/drive/12H95pFpFUSYfxpRHWuS-gOQFi81hZhX-"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="" width="87" height="15"></a></td>
  </tr>
</tbody>
</table>

If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so that we can add a reference.


# Cite

You are welcome to use our LM models in your work or research, if so, we kindly ask you to cite it using the following entry:
```
@misc{Shiraz,
    author = {Mehrdad Azizi, Reza Salehi Chegeni, Parisa Mousavi, Iman Hashemi},
    title = {[Optimizing Pre-trained BERT-based Models for Persian Language Processing]},
    year = {2024},
    publisher = {LifeWeb}
}
```

# Contributors

- Mehrdad Azizi: [**Linkedin**](https://www.linkedin.com/in/mehrdad-azizi-50839489/), [**Github**](https://github.com/mehrazi)
- Reza Salehi Chegeni: [**Linkedin**](https://www.linkedin.com/in/reza-salehi-chegeni-6988ba271/), [**Github**](https://github.com/rezasalehichegeni)
- Parisa Mousavi: [**Linkedin**](https://www.linkedin.com/in/seyede-parisa-mousavi/), [**Github**](https://github.com/Mousavi-Parisa)
- Iman Hashemi: [**Linkedin**](https://www.linkedin.com/in/iman-hashemi-403738a5), [**Github**](https://github.com/hashemiiman)
- Lifeweb: [**HuggingFace**](https://huggingface.co/lifeweb-ai), [**Official Website**](https://lifewebco.com/), [**Linkedin**](https://www.linkedin.com/company/lifewebir/mycompany/)

# Releases

**v1.0(2024-03-09)**

First version of **Shiraz** model trained on **DIVAN**.