Model Card for Qwen3-4B-CPT-Base

Continued pre-trained (CPT) variant of Qwen3-4B-Base, adapted to Indonesian on ~200M domain tokens. Base model — not instruction-tuned.

Model Details

Model Description

Qwen3-4B-CPT-Base extends Qwen/Qwen3-4B-Base with continued pre-training on a ~200M-token Indonesian corpus (news, Wikipedia, social media). The goal is Indonesian-domain adaptation as the foundation for downstream SFT. It is a base model: it performs raw text completion and is not tuned for instruction-following or chat. Part of the Model Narasi Isu pipeline (CPT -> SFT -> Deployment) for Indonesian public-issue monitoring and narrative analysis.

  • Developed by: AITF UGM 2026
  • Model type: Causal decoder-only LLM (continued pre-training)
  • Language(s) (NLP): Indonesian (Bahasa Indonesia); English technical terms preserved
  • License: Qwen License
  • Finetuned from model [optional]: Qwen/Qwen3-4B-Base

Model Sources [optional]

Uses

Direct Use

Indonesian-domain text completion. Perplexity benchmarking against vanilla Qwen3 baselines.

Downstream Use [optional]

Foundation for supervised fine-tuning (SFT) on Indonesian tasks: summarization, issue narrative analysis (ABSA), dashboard previews, chatbot Q&A.

Out-of-Scope Use

Not for chat or instruction-following before SFT. Not for high-stakes decisions without human review. Not a safety-aligned assistant.

Bias, Risks, and Limitations

Not instruction-tuned: no reliable JSON, chat, or task behavior. Corpus is news-heavy (70%), so outputs may reflect media and social-media biases. Coverage skews to topics present in the corpus window.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Validate outputs; apply SFT before task deployment.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "aitf-ugm-2026/Qwen3-4B-CPT-Base"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "Ibu kota Indonesia adalah"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))

vLLM (use completions endpoint, not chat):

vllm serve aitf-ugm-2026/Qwen3-4B-CPT-Base \
  --gpu-memory-utilization 0.90 --max-model-len 8192

Training Details

Training Data

~200M tokens, group-aware split (train/val/test = 0.99 / 0.005 / 0.005).

Source Share Tokens
Berita (news) 70% ~140M
Wikipedia (id) 20% ~40M
Sosial media 10% ~20M
Total 100% ~200M

Train split: 325,860 records / ~198M tokens. Test: 1,655 records (news 1,098 / socmed 191 / wiki 366).

Training Procedure

Preprocessing [optional]

Group-aware train/val/test split to avoid leakage. Sequence packing enabled. Local /content/ processing before Drive copy.

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Method: LoRA, RSLoRA enabled
  • LoRA rank / alpha: 128 / 256
  • Extra modules: embed_tokens, lm_head included
  • LoRA dropout: 0.0
  • Max seq length: 8192
  • Packing: True; 4-bit load: False
  • Epochs: 1
  • Per-device batch: 12; grad accumulation: 16; effective batch: 192
  • Learning rate: 1e-5; embedding LR: 5e-6
  • Scheduler: cosine; warmup ratio: 0.03
  • Optimizer: adamw_8bit; weight decay: 0.01
  • Seed: 3407; early stopping enabled
  • Save format: merged_16bit

Evaluation

Testing Data, Factors & Metrics

Testing Data

Held-out test set: 1,654 documents (news / socmed / wiki).

Factors

Disaggregated by source domain: news, social media, Wikipedia.

Metrics

Perplexity (lower is better). Eval: ~1M tokens, max_length=4096, stride=1024, bf16 / 4-bit.

Results

Model Full News Socmed Wiki
Qwen3-4B-CPT-Base (this) 4.561 4.108 4.418 6.492
Qwen3-4B-Base (vanilla) 5.930 5.389 6.438 7.757
Improvement ~23% ~24% ~31% ~16%

Summary

CPT cuts perplexity 23% overall vs vanilla Qwen3-4B-Base, and beats vanilla Qwen3-8B-Base on all four subsets. Domain adaptation outweighs raw parameter count for this Indonesian domain. Largest gain on social media (31%).

Technical Specifications [optional]

Model Architecture and Objective

Qwen3 causal decoder-only transformer. Objective: continued causal language-model pre-training (next-token prediction).

Compute Infrastructure

Hardware

NVIDIA A100 80GB (Google Colab Pro+).

Software

Unsloth, TRL, HuggingFace Transformers, PEFT, bitsandbytes. Monitoring: WandB.

Citation [optional]

BibTeX:

@misc{qwen3_4b_cpt_base,
  title  = {Qwen3-4B-CPT-Base: Indonesian Continued Pre-Training},
  author = {AITF UGM 2026},
  year   = {2026},
  note   = {Model Narasi Isu pipeline}
}

APA:

AITF UGM 2026. (2026). Qwen3-4B-CPT-Base: Indonesian Continued Pre-Training. Model Narasi Isu pipeline.

More Information

Model Narasi Isu: Indonesian public-issue monitoring and narrative analysis pipeline.

Model Card Authors

AITF UGM 2026

Model Card Contact

https://huggingface.co/aitf-ugm-2026

Downloads last month
272
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aitf-kpm-ugm/Qwen3-4B-CPT-Base

Finetuned
(290)
this model
Adapters
2 models