Edit model card



EgyBERT is a large language model focused exclusively on Egyptian dialectal texts. The model was pretrained on two large-scale corpora: the Egyptian Tweets Corpus (ETC), which contains +34 million tweets, and the Egyptian Forum Corpus, which includes +44 million sentences collected from various online forums. The datasets comprise 10.4GB of text. The code files along with the results are available on repo.

BibTex

If you use EgyBERT model in your scientific publication, or if you find the resources in this repository useful, Kindly cite our paper as follows (citation details to be updated):

@article{qarah2024egybert,
  title={EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora},
  author={Qarah, Faisal},
  journal={arXiv preprint arXiv:2408.03524},
  year={2024}
}


Downloads last month
45
Safetensors
Model size
144M params
Tensor type
F32
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.