arxiv:2103.06678

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Published on Mar 11, 2021

Authors:

Nizar Habash

Abstract

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 37

Browse 37 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2103.06678 in a dataset README.md to link it from this page.

Spaces citing this paper 10

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.