Edit model card


GPT-SW3 is a 3.5B parameter autoregressive language model, trained on a 100 GB Swedish corpus.

Model details

GPT-SW3 follows the GPT architecture, as implemented in the Megatron-LM framework. The model consists of 30 Transformer layers, with 32 attention heads each. The embedding dimension is 3072 and the dimension of the feedforward layer is 12288. The tokenizer used is BPE, with a vocabulary size of 50304.

Training data

GPT-SW3 was trained on a 100GB Swedish corpus, consisting of the following datasets:

Dataset Genre Size (GB)
Anföranden Politics 0.9
DCEP Politics 0.6
DGT Politics 0.7
Fass Medical 0.6
Flashback Forum 22.0
Författningar Legal 0.1
Web data Misc 45.0
JRC Legal 0.4
Litteraturbanken Books 0.3
OSCAR Misc 28.0
SOU Politics 5.3
Subtitles Drama 1.3
Wikipedia Facts 1.8

Training details

GPT-SW3 was trained for 97651 steps on 16 DGX A100 servers from the Berzelius SuperPod, using global batch size of 512 and maximum sequence length of 2048.

Downloads last month
Hosted inference API

Unable to determine this model’s pipeline type. Check the docs .