Model D3

Multilingual model

This model uses a causal language modeling approach during training. This approach modifies the way the model accesses and processes words that precede the current token in the input sequence. Unlike masked language modeling in a sequence-to-sequence model, casual language modeling focuses on predicting the single next token. It does this by conditioning on all previous tokens in the sequence, ensuring that the model only has access to prior tokens and not future ones.

Model Details

When performing experiments with a decoder-only model, we selected BLOOM as the architecture.

Model Description

Developed by: Ronny Paul
Model type: BLOOM
Language(s) (NLP): Northern Sami, Norwegian and Finnish
Finetuned from model: TurkuNLP/gpt3-finnish-xl

Uses

This model was used in an experiment to determine which architecture is favourable in a low-resource-setting with Northern Sami.

Dataset

The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.

The Finnish dataset is a filtered and post-processed corpus of comments from Reddit. The comments were published from 2006 to 2022 and consist of 4 524 360 unique messages. The dataset was released by Finnish-NLP. The Norwegian dataset NoReC was created as part of the SANT project (Sentiment Analysis for Norwegian Text), a collaboration between the Language Technology Group (LTG) at the Department of Informatics at the University of Oslo, the Norwegian Broadcasting Corporation (NRK), Schibsted Media Group and Aller Media. This first release of the corpus comprises 35,194 reviews extracted from eight different news sources. In terms of publishing date, the reviews mainly cover the time span from 2003 to 2017, although it also includes a handful of reviews dating back as far as 1998.

How to Get Started with the Model

model = BloomForCausalLM.from_pretrained("rpa020/D3")

Performance

CE Loss overall: 4.15 Perplexity overall: 63.5