Papers
arxiv:2407.13657

FuLG: 150B Romanian Corpus for Language Model Pretraining

Published on Jul 18
Authors:
,
,
,
,

Abstract

Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.13657 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.13657 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.