tehran / README.md

Update README.md

4018dd9 verified 9 months ago

12 kB

	---
	language: fa
	tags:
	- persian
	- RoBERTa
	license: apache-2.0
	pipeline_tag: fill-mask
	mask_token: '[MASK]'
	widget:
	- text: 'در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم.'

	extra_gated_prompt: "This MODEL IS NOT FREE, please enter your contact informations. We will reach you out"
	extra_gated_fields:
	contact information: text
	---

	<p align="center">

	# <img src="https://avatars.githubusercontent.com/u/75159340?s=60&v=4" alt="Logo" width="50" height="50"> <a href="https://lifewebco.com"> Lifeweb </a>

	</p>

	### Tehran Language Model
	Welcome to Tehran, the repository for Lifeweb's language model.
	First versions of our models are all trained on our own dataset called Divan with more than 164 million documents and more than 10B tokens which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model!



	# Use Model
	You can easily access the models using the sample code provided in the below.

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline
	# v1.0
	model_name = "lifeweb-ai/tehran"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForMaskedLM.from_pretrained(model_name)

	text = "در همین لحظه که شما مشغول خواندن این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."
	print(tokenizer.tokenize(text))

	# ['در', 'همین', 'لحظه', 'که', 'شما', 'مشغول', 'خواندن', 'این', 'متن', 'هستید،', 'میلیون', '[zwnj]', 'ها', 'دیتا', 'در', 'فضای', 'انلاین', 'در', 'حال', 'تولید', 'است', '.', 'ما', 'در', 'لایف', 'وب', 'به', 'جمع', '[zwnj]', 'اوری', '##،', 'پردازش', 'و', 'تحلیل', 'این', 'کلان', 'داده', '(', 'big', 'data', ')', 'می', '[zwnj]', 'پردازیم', '.', '.']

	# fill mask task
	text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."

	classifier = FillMaskPipeline(model=model, tokenizer=tokenizer)
	result = classifier(text)
	print(result[0])
	#{'score': 0.3825972378253937, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'}
	```


	# Results

	The Tehran is evaluated on three downstream NLP tasks comprising NER, Sentiment Analysis, and Emotion Detection. Tehran outperforms every other Persian language model in terms of accuracy and macro F1.


	Obvious from the table below, you can find the colab codes for each task to use as a tutorial besides the macro F1 score.These Colab codes are run equally on 4x2080 TI graphic cards.

	<table class="tg">
	<thead>
	<tr>
	<th class="tg-c3ow">Model</th>
	<th class="tg-c3ow" colspan="2">NER</th>
	<th class="tg-c3ow" colspan="2">Sentiment</th>
	<th class="tg-c3ow" colspan="1">Emotion</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td class="tg-0pky"></td>
	<td class="tg-c3ow">Arman</td>
	<td class="tg-c3ow">Peyma</td>
	<td class="tg-c3ow"> Sentipers (multi) </td>
	<td class="tg-c3ow"> Snappfood </td>
	<td class="tg-c3ow"> Arman </td>
	</tr>

	<tr>
	<td class="tg-0pky">lifeweb-ai/tehran</td>
	<td class="tg-c3ow"><strong> 71.87% <br>
	<td class="tg-c3ow"><strong> 90.79% <br>
	<td class="tg-c3ow"><strong> 63.75% <br>
	<td class="tg-c3ow"><strong> 88.74% <br>
	<td class="tg-c3ow"><strong> 77.73% <br>
	</tr>
	<tr>
	<td class="tg-0pky">lifeweb-ai/shiraz</td>
	<td class="tg-c3ow"> 67.62% <br><a href="https://colab.research.google.com/drive/15PUAGy9MUSBO3LPdMJ4h9DVKibREv9oY"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 86.24% <br><a href="https://colab.research.google.com/drive/1lzVsDpl6_WhxsW8mtUNjhXzQPBMNL6Q2"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 59.17% <br><a href="https://colab.research.google.com/drive/1L87oYYDBY1Fi0GGvjRGSdSk2rZ5vshUV"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 88.01% <br><a href="https://colab.research.google.com/drive/1-S-VE83IGGGS9lZVydVKa4SnxshFSvT6"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 66.97% <br><a href="https://colab.research.google.com/drive/12SpUEsOP1I2cCp-gQsifONyu9yDUGuKG"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	</tr>
	<tr>
	<td class="tg-0pky">sbunlp/fabert</td>
	<td class="tg-c3ow"> 71.23% <br><a href="https://colab.research.google.com/drive/1NHUG8GdGEx1R76jr1MBC8sqDFWdsAxQk"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 88.53% <br><a href="https://colab.research.google.com/drive/1I6Nl9W_Br-WVV4odUcw0um_-dypjFyrp"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 58.51% <br><a href="https://colab.research.google.com/drive/1jdLotilq7hedyQ8x9aTUdgJ2IP-EDLWv"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 88.60% <br><a href="https://colab.research.google.com/drive/1DsIFzDrC_HNDaQyltJtiT3DjGA9blg_B"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 72.65% <br><a href="https://colab.research.google.com/drive/12H95pFpFUSYfxpRHWuS-gOQFi81hZhX-"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="" width="87" height="15"></a></td>
	</tr>
	<tr>
	<td class="tg-0pky">ViraIntelligentDataMining/AriaBERT</td>
	<td class="tg-c3ow"> 69.12% <br><a href="https://colab.research.google.com/drive/1s0aSjPYntinkupgaAiGZIvwzKXWjNHgA"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 87.15% <br><a href="https://colab.research.google.com/drive/1qPy0nFHC8bYj9OskUyksF0gQRQ6hRgbT"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 59.26% <br><a href="https://colab.research.google.com/drive/1P9YaP9Fem5pSlJqPxP2jG2IBq9TsLbaz"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 87.96% <br><a href="https://colab.research.google.com/drive/1wuGFELbqx0eE1cvmPZRgfklTTa3SkpyW"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 69.11% <br><a href="https://colab.research.google.com/drive/1UINarSRMy4yKbSeXKgSUf84IvJh-JC4q"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="" width="87" height="15"></a></td>
	</tr>
	<tr>
	<td class="tg-0pky">HooshvareLab/bert-fa-zwnj-base</td>
	<td class="tg-c3ow"> 67.49% <br><a href="https://colab.research.google.com/drive/1HApEhtOm2p0ra1NwHLbptaxNeKqXC_TM"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 85.73% <br><a href="https://colab.research.google.com/drive/1e67UzkbX1HPgayfi8Z1rNNy79AACr1lV"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 59.61% <br><a href="https://colab.research.google.com/drive/1pub2tq2Qvb08s2w4cE-AfOwzWYXH6rsM"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 87.58% <br><a href="https://colab.research.google.com/drive/1PyjCTXFB-SXfrG8Bjjpr9py39Q9J8oGZ"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 59.27% <br><a href="https://colab.research.google.com/drive/13jUeb2694W9SHWNYa1KMbvmeCAhnDpv0"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	</tr>
	<tr>
	<td class="tg-0pky">HooshvareLab/roberta-fa-zwnj-base</td>
	<td class="tg-c3ow"> 69.73% <br><a href="https://colab.research.google.com/drive/1a0o6Mx3jlK8ItWdIQgThM81hlSTE6sur"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 86.21% <br><a href="https://colab.research.google.com/drive/1fMXN5OeWmeLlLnG1gdznvq9ruBmP3UTv"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 56.23% <br><a href="https://colab.research.google.com/drive/18OzPDKH1mB6-uDVmN0WWZz_etwrsZ_A3"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 87.19% <br><a href="https://colab.research.google.com/drive/1E-rfJYZmid3a-bEpskU_j_3S4q_SQmGH"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	<td class="tg-c3ow"> 57.96% <br><a href="https://colab.research.google.com/drive/1NRphgik9y0fmZP_7MDUjMq6zTP2AfTMj"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Colab Code" width="87" height="15"></td>
	</tr>


	</tbody>
	</table>


	If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so that we can add a reference.


	# Cite

	You are welcome to use our LM models in your work or research, if so, we kindly ask you to cite it using the following entry:
	```
	@misc{Tehran,
	author = {Mehrdad Azizi, Reza Salehi Chegeni, Parisa Mousavi, Iman Hashemi},
	title = {[Optimizing Pre-trained BERT-based Models for Persian Language Processing]},
	year = {2024},
	publisher = {LifeWeb}
	}
	```

	# Contributors

	- Mehrdad Azizi: [Linkedin](https://www.linkedin.com/in/mehrdad-azizi-50839489/), [Github](https://github.com/mehrazi)
	- Reza Salehi Chegeni: [Linkedin](https://www.linkedin.com/in/reza-salehi-chegeni-6988ba271/), [Github](https://github.com/rezasalehichegeni)
	- Parisa Mousavi: [Linkedin](https://www.linkedin.com/in/seyede-parisa-mousavi/), [Github](https://github.com/Mousavi-Parisa)
	- Iman Hashemi: [Linkedin](https://www.linkedin.com/in/iman-hashemi-403738a5), [Github](https://github.com/hashemiiman)
	- Lifeweb: [HuggingFace](https://huggingface.co/lifeweb-ai), [Official Website](https://lifewebco.com/), [Linkedin](https://www.linkedin.com/company/lifewebir/mycompany/)

	# Releases

	v1.0(2024-03-09)

	First version of Tehran model trained on DIVAN.