gpt2-dutch-instruct

A GPT-2 small (124M parameter) language model trained from scratch on Dutch text, then fine-tuned for instruction following using supervised fine-tuning (SFT). This model understands and generates Dutch.

Model details

Property Value
Architecture GPT-2 small
Parameters 123.8M
Layers 12
Attention heads 12
Hidden dimension 768
Context length 512 tokens
Vocabulary size 50,000 (Dutch BPE)
Weights fp16 / safetensors (473 MB)
Inference speed (CPU) 0.9 tok/s

Files

File Format Size
model.safetensors fp16 473 MB
dutch-gpt2-f16.gguf GGUF F16 249 MB
dutch-gpt2-q8_0.gguf GGUF Q8_0 132 MB

Use with llama.cpp

# Download
wget https://huggingface.co/Thorstin/gpt2-dutch-instruct/resolve/main/dutch-gpt2-q8_0.gguf

# Run
llama-cli -m dutch-gpt2-q8_0.gguf \
  -p "### Instructie:\nWat is de hoofdstad van Nederland?\n### Antwoord:\n" \
  -n 200

Use with Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./dutch-gpt2-q8_0.gguf
TEMPLATE """### Instructie:
{{ .Prompt }}
### Antwoord:
"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.3
PARAMETER num_ctx 512
EOF

ollama create dutch-gpt2 -f Modelfile
ollama run dutch-gpt2

Training

Phase 1 β€” Pretraining from scratch

  • Dataset: CC-100 Dutch (~37 GB raw, ~6.6B tokens), streamed
  • Tokenizer: ByteLevel BPE trained on first 500K CC-100 Dutch documents
  • Hardware: NVIDIA Tesla T4 (16 GB VRAM)
  • Tokens trained: ~5B
  • Steps: 154,000
  • Final loss: 3.54
  • Duration: ~70 GPU hours
  • Key settings: fp16=True, gradient_checkpointing=True, batch_size=32, lr=5e-4, cosine scheduler

Phase 2 β€” Instruction fine-tuning (SFT)

  • Dataset: BramVanroy/alpaca-cleaned-dutch β€” 46,163 Dutch instruction/response pairs
  • Framework: TRL 1.6.0 SFTTrainer
  • Epochs: 3
  • Steps: 4,329
  • Loss: 3.31 β†’ 1.14
  • Duration: ~1.25 hours

Instruction format

### Instructie:
<vraag of instructie>
### Antwoord:
<antwoord>

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Thorstin/gpt2-dutch-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model.eval()

def chat(instruction: str, max_new_tokens: int = 200) -> str:
    prompt = f"### Instructie:\n{instruction}\n### Antwoord:\n"
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response.split("### Antwoord:")[-1].strip()

print(chat("Wat is de hoofdstad van Nederland?"))

Benchmark results (lm-evaluation-harness, limit=200)

Task Accuracy Accuracy (norm)
hellaswag_nl 24.50% 28.50%
arc_nl 19.00% 29.00%
blimp_nl 80.67% 79.51%

Random baseline: 50% for BLiMP-NL (binary), 25% for HellaSwag/ARC (4-way).

Sample outputs

Prompt Response
Wat is de hoofdstad van Nederland? De hoofdstad van Nederland is Amsterdam....
Leg uit wat fotosynthese is. Fotosynthese is het proces waarbij planten lichtenergie van de zon omzetten in chemische energie die ze gebruiken om koo...
Schrijf een kort gedicht over de zee. De golven slaan tegen het raam, Een kalmerende bries draagt de geur van zout en vers gezette koffie. Het geluid van gebr...

Limitations

  • 124M parameters is a hard ceiling β€” expect occasional repetition, factual errors, and shorter coherent responses compared to larger models
  • Context window is limited to 512 tokens

Framework versions

Package Version
TRL 1.6.0
Transformers 4.48
PyTorch 2.9.1+cu128
Datasets 2.16
Tokenizers 0.21
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support