shiraz / README.md
RezaSalehiChegeni's picture
Update README.md
5b2ff8c verified
|
raw
history blame
11.9 kB
---
language: fa
tags:
- persian
- mobilebert
license: apache-2.0
pipeline_tag: fill-mask
mask_token: '[MASK]'
widget:
- text: 'در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم.'
---
<p align="center">
# <img src="https://avatars.githubusercontent.com/u/75159340?s=60&v=4" alt="Logo" width="50" height="50"> <a href="https://lifewebco.com"> Lifeweb </a>
</p>
### Shiraz Language Model
Welcome to Shiraz, the repository for Lifeweb's language model.
First versions of our models are all trained on our own dataset called **Divan** with more than **164 million documents** and more than **10B tokens** which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model!
# Use Model
You can easily access the models using the sample code provided below.
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline
# v1.0
model_name = "lifeweb-ai/shiraz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
text = "در همین لحظه که شما مشغول خواندن این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."
print(tokenizer.tokenize(text))
# ['در', 'همین', 'لحظه', 'که', 'شما', 'مشغول', 'خواندن', 'این', 'متن', 'هستید،', 'میلیون', '[zwnj]', 'ها', 'دیتا', 'در', 'فضای', 'انلاین', 'در', 'حال', 'تولید', 'است', '.', 'ما', 'در', 'لایف', 'وب', 'به', 'جمع', '[zwnj]', 'اوری', '##،', 'پردازش', 'و', 'تحلیل', 'این', 'کلان', 'داده', '(', 'big', 'data', ')', 'می', '[zwnj]', 'پردازیم', '.', '.']
# fill mask task
text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."
classifier = FillMaskPipeline(model=model, tokenizer=tokenizer)
result = classifier(text)
print(result[0])
#{'score': 0.3584367036819458, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'}
```
# Results
The **Shiraz** is evaluated on three downstream NLP tasks comprising **NER**, **Sentiment Analysis**, and **Emotion Detection**. Shiraz is considerably faster, and its accuracy remains highly competitive without compromising much on speed. According to [**MobileBERT paper**](https://arxiv.org/pdf/2004.02984.pdf), this model is 4.3× smaller and 5.5× faster than BERT-base.
Obvious from the table below, you can find the colab codes for each task to use as a tutorial besides the macro F1 score.
<table class="tg">
<thead>
<tr>
<th class="tg-c3ow">Model</th>
<th class="tg-c3ow" colspan="2">NER</th>
<th class="tg-c3ow" colspan="2">Sentiment</th>
<th class="tg-c3ow" colspan="1">Emotion</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky"></td>
<td class="tg-c3ow">Arman</td>
<td class="tg-c3ow">Peyma</td>
<td class="tg-c3ow"> Sentipers (multi) </td>
<td class="tg-c3ow"> Snappfood </td>
<td class="tg-c3ow"> Arman </td>
</tr>
<tr>
<td class="tg-0pky">lifeweb-ai/tehran</td>
<td class="tg-c3ow"><strong> 71.87% <br>
<td class="tg-c3ow"><strong> 90.79% <br>
<td class="tg-c3ow"><strong> 63.75% <br>
<td class="tg-c3ow"><strong> 88.74% <br>
<td class="tg-c3ow"><strong> 77.73% <br>
</tr>
<tr>
<td class="tg-0pky">lifeweb-ai/shiraz</td>
<td class="tg-c3ow"> 67.62% <br><a href="https://colab.research.google.com/drive/15PUAGy9MUSBO3LPdMJ4h9DVKibREv9oY"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 86.24% <br><a href="https://colab.research.google.com/drive/1lzVsDpl6_WhxsW8mtUNjhXzQPBMNL6Q2"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 59.17% <br><a href="https://colab.research.google.com/drive/1L87oYYDBY1Fi0GGvjRGSdSk2rZ5vshUV"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 88.01% <br><a href="https://colab.research.google.com/drive/1-S-VE83IGGGS9lZVydVKa4SnxshFSvT6"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 66.97% <br><a href="https://colab.research.google.com/drive/12SpUEsOP1I2cCp-gQsifONyu9yDUGuKG"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
</tr>
<tr>
<td class="tg-0pky">HooshvareLab/bert-fa-zwnj-base</td>
<td class="tg-c3ow"> 67.49% <br><a href="https://colab.research.google.com/drive/1HApEhtOm2p0ra1NwHLbptaxNeKqXC_TM"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 85.73% <br><a href="https://colab.research.google.com/drive/1e67UzkbX1HPgayfi8Z1rNNy79AACr1lV"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 59.61% <br><a href="https://colab.research.google.com/drive/1pub2tq2Qvb08s2w4cE-AfOwzWYXH6rsM"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 87.58% <br><a href="https://colab.research.google.com/drive/1PyjCTXFB-SXfrG8Bjjpr9py39Q9J8oGZ"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 59.27% <br><a href="https://colab.research.google.com/drive/13jUeb2694W9SHWNYa1KMbvmeCAhnDpv0"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
</tr>
<tr>
<td class="tg-0pky">HooshvareLab/roberta-fa-zwnj-base</td>
<td class="tg-c3ow"> 69.73% <br><a href="https://colab.research.google.com/drive/1a0o6Mx3jlK8ItWdIQgThM81hlSTE6sur"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 86.21% <br><a href="https://colab.research.google.com/drive/1fMXN5OeWmeLlLnG1gdznvq9ruBmP3UTv"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 56.23% <br><a href="https://colab.research.google.com/drive/18OzPDKH1mB6-uDVmN0WWZz_etwrsZ_A3"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 87.19% <br><a href="https://colab.research.google.com/drive/1E-rfJYZmid3a-bEpskU_j_3S4q_SQmGH"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 57.96% <br><a href="https://colab.research.google.com/drive/1NRphgik9y0fmZP_7MDUjMq6zTP2AfTMj"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
</tr>
<tr>
<td class="tg-0pky">ViraIntelligentDataMining/AriaBERT</td>
<td class="tg-c3ow"> 69.12% <br><a href="https://colab.research.google.com/drive/1s0aSjPYntinkupgaAiGZIvwzKXWjNHgA"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 87.15% <br><a href="https://colab.research.google.com/drive/1qPy0nFHC8bYj9OskUyksF0gQRQ6hRgbT"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 59.26% <br><a href="https://colab.research.google.com/drive/1P9YaP9Fem5pSlJqPxP2jG2IBq9TsLbaz"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 87.96% <br><a href="https://colab.research.google.com/drive/1wuGFELbqx0eE1cvmPZRgfklTTa3SkpyW"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 69.11% <br><a href="https://colab.research.google.com/drive/1UINarSRMy4yKbSeXKgSUf84IvJh-JC4q"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="" width="87" height="15"></a></td>
</tr>
<tr>
<td class="tg-0pky">sbunlp/fabert</td>
<td class="tg-c3ow"> 71.23% <br><a href="https://colab.research.google.com/drive/1NHUG8GdGEx1R76jr1MBC8sqDFWdsAxQk"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 88.53% <br><a href="https://colab.research.google.com/drive/1I6Nl9W_Br-WVV4odUcw0um_-dypjFyrp"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 58.51% <br><a href="https://colab.research.google.com/drive/1jdLotilq7hedyQ8x9aTUdgJ2IP-EDLWv"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 88.60% <br><a href="https://colab.research.google.com/drive/1DsIFzDrC_HNDaQyltJtiT3DjGA9blg_B"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
<td class="tg-c3ow"> 72.65% <br><a href="https://colab.research.google.com/drive/12H95pFpFUSYfxpRHWuS-gOQFi81hZhX-"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="" width="87" height="15"></a></td>
</tr>
</tbody>
</table>
If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so that we can add a reference.
# Cite
You are welcome to use our LM models in your work or research, if so, we kindly ask you to cite it using the following entry:
```
@misc{Shiraz,
author = {Mehrdad Azizi, Reza Salehi Chegeni, Parisa Mousavi, Iman Hashemi},
title = {[Optimizing Pre-trained BERT-based Models for Persian Language Processing]},
year = {2024},
publisher = {LifeWeb}
}
```
# Contributors
- Mehrdad Azizi: [**Linkedin**](https://www.linkedin.com/in/mehrdad-azizi-50839489/), [**Github**](https://github.com/mehrazi)
- Reza Salehi Chegeni: [**Linkedin**](https://www.linkedin.com/in/reza-salehi-chegeni-6988ba271/), [**Github**](https://github.com/rezasalehichegeni)
- Parisa Mousavi: [**Linkedin**](https://www.linkedin.com/in/seyede-parisa-mousavi/), [**Github**](https://github.com/Mousavi-Parisa)
- Iman Hashemi: [**Linkedin**](https://www.linkedin.com/in/iman-hashemi-403738a5), [**Github**](https://github.com/hashemiiman)
- Lifeweb: [**HuggingFace**](https://huggingface.co/lifeweb-ai), [**Official Website**](https://lifewebco.com/), [**Linkedin**](https://www.linkedin.com/company/lifewebir/mycompany/)
# Releases
**v1.0(2024-03-09)**
First version of **Shiraz** model trained on **DIVAN**.