ripkiiiii/IDK-1-Instruct-Data
Viewer β’ Updated β’ 4.81k
IDK-1-Instruct is an instruction-tuned version of IDK-1, a 106M parameter Indonesian small language model (SLM) trained from scratch.
Part of the I Don't Know (IDK) AI series by Deflated.
| Property | Value |
|---|---|
| Base model | IDK-1 (pre-trained, step 25k) |
| Parameters | 106.24M |
| Architecture | LLaMA-style decoder-only transformer |
| Vocab size | 40,002 (40k BPE + 2 special tokens) |
| Context length | 512 tokens |
| Language | Indonesian (Bahasa Indonesia) |
| License | Apache 2.0 |
dim = 768
n_layers = 12
n_heads = 12
n_kv_heads = 4 (GQA)
ffn_dim = 2048
RoPE theta = 500,000
logit_cap = 30.0 (Gemma 2 style soft-capping)
{"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]}
| Round | Base | Data | LR | Epochs | Best Val |
|---|---|---|---|---|---|
| v1 | IDK-1 step 25k | 1,390 pairs | 2e-5 | 3 | 3.0506 |
| v2 | IDK-1 step 25k | 3,010 pairs | 3e-5 | 5 | 2.1709 |
| v3 | sft_best v2 | 3,810 pairs | 1e-5 | 3 | 2.0808 |
| v4 | sft_best v3 | 4,810 pairs | 5e-6 | 3 | 1.3670 |
Training was done on Kaggle (T4 GPU) using PyTorch with loss masking on non-assistant tokens.
<|im_start|> β id 40000
<|im_end|> β id 40001
import torch
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
im_start = tokenizer.token_to_id("<|im_start|>")
im_end = tokenizer.token_to_id("<|im_end|>")
def build_prompt(user_message):
return f"<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant\n"
# Load model (see IDK-1 repo for model definition)
# model = IDK1Model(IDK1Config())
# ckpt = torch.load("sft_best.pt", map_location="cpu")
# model.load_state_dict(ckpt["model"])
prompt = build_prompt("Jelaskan apa itu kecerdasan buatan dalam 3 poin.")
IDK-1 was built as a learning + portfolio project to demonstrate training an Indonesian SLM from scratch on commodity hardware (Kaggle free tier).
idk-ai/IDK-1idk-ai/IDK-1-Instruct-Data@misc{idk1instruct2026,
title = {IDK-1-Instruct: Instruction-tuned Indonesian Small Language Model},
author = {Muhammad Rifky Firmansyah Sujana},
year = {2026},
url = {https://huggingface.co/idk-ai/IDK-1-Instruct}
}