Thai LLM 928M Base Preview

This is a Thai-focused Qwen2-style causal language model trained from scratch.

It is a base model, not an instruction-tuned/chat model. It is intended for continued pretraining, evaluation, research, and downstream supervised fine-tuning.

Model Details

  • Architecture: Qwen2ForCausalLM
  • Parameters: ~928M
  • Initialization: random initialization, trained from scratch
  • Vocabulary: 32,000-token Thai byte-level BPE tokenizer
  • Context length: 2,048 tokens
  • Hidden size: 2,048
  • Layers: 18
  • Attention heads: 16
  • KV heads: 4
  • Intermediate size: 5,504

Training Snapshot

  • Checkpoint: best validation checkpoint
  • Step: 5,500
  • Validation loss: 2.51767520904541
  • Training objective: next-token prediction
  • Precision during training: bf16

Training Data

The training configuration included Thai-focused corpora such as:

  • SPAISS6F1/slm-pretrain-corpus
  • SPAISS6F1/Finance
  • pythainlp/thai-wiki-dataset-v3
  • pythainlp/thailaw-v1.0
  • pythainlp/thai-constitution-corpus
  • pythainlp/thai-financial-dataset

Please audit dataset licenses and suitability before commercial use.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "YOUR_USERNAME/thai-llm-928m-base-preview"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16)
model.eval()

prompt = "ประเทศไทยมี"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

  • This is an early base checkpoint, not a chat assistant.
  • Generation quality may be unstable or repetitive.
  • The model has not been aligned for safety.
  • The model may hallucinate or produce inappropriate text.
  • It should not be used for high-stakes decisions without further evaluation and alignment.
Downloads last month
-
Safetensors
Model size
0.9B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support