Hafez_Bert_based language model
The paragraph describes the development of a language model named "Hafez," which references the famous Persian poet from Shiraz, Iran. Here’s a breakdown of the information presented:
Model Type: Hafez is based on the BERT architecture, which is a popular model for natural language processing (NLP).
Cultural Reference: The model is named after Hafez, a renowned Persian poet known for his deeply emotional and philosophical verses. This choice of name suggests a connection to Persian literature and an intention to handle language in a way that may resonate with the cultural significance of the poet. (NLP).
Training Data: The model has been trained on a substantial dataset comprising over 12 billion tokens. The text used to train the Hafez language model is comprised of two parts: 90% consists of educational materials, including research papers, dissertations, and theses, while the remaining 10% includes general texts. This careful selection of content aims to provide the model with a strong foundation in academic language and discourse.
Text Cleaning and Preprocessing: The training data underwent a cleaning and preprocessing phase, which is essential for ensuring that the data is of high quality and suitable for training a machine learning model. The cleaning and preparation were conducted using "Viravirast text tools," which are likely specialized tools designed for text processing in this context.
How to use
from transformers import pipeline
unmasker = pipeline('fill-mask', model='ViravirastSHZ/Hafez_Bert')
print(unmasker("شیراز یکی از زیباترین [MASK] ایران است."))
Results
We have conducted evaluations of the Hafez language model specifically on a text classification task, and we welcome others to explore its performance on various downstream tasks as well. The F1 score will be utilized as the primary metric for measuring the model's effectiveness. We would greatly appreciate any efforts in testing Hafez across different applications.
Model | Text classificaion | |
---|---|---|
Msobhi/virgool_62k | ||
ViravirastSHZ/Hafez_Bert | test-F1 score: 0.437764 |
|
lifeweb-ai/shiraz | test-F1 score: 0.349834 |
Cite
@misc{Hafez language model, author = {Amin Rahmani}, title = {[Pre-trained BERT-based language Model for Persian Language]}, year = {2024}, publisher = {Viravirast} }
Contributor
-Amin Rahmani viravirast -Amin Rahmani Linkedin
- Downloads last month
- 7