IDK-1-Instruct

IDK-1-Instruct is an instruction-tuned version of IDK-1, a 106M parameter Indonesian small language model (SLM) trained from scratch.

Part of the I Don't Know (IDK) AI series by Deflated.


Model Details

Property Value
Base model IDK-1 (pre-trained, step 25k)
Parameters 106.24M
Architecture LLaMA-style decoder-only transformer
Vocab size 40,002 (40k BPE + 2 special tokens)
Context length 512 tokens
Language Indonesian (Bahasa Indonesia)
License Apache 2.0

Architecture Config

dim       = 768
n_layers  = 12
n_heads   = 12
n_kv_heads = 4   (GQA)
ffn_dim   = 2048
RoPE theta = 500,000
logit_cap  = 30.0 (Gemma 2 style soft-capping)

Training

SFT Data

  • 4,810 instruction pairs in ChatML format
  • Topics: factual Indonesian Q&A, summarization, ELI5 explanations, practical tips, conversations, count-following tasks
  • Format:
{"messages": [
  {"role": "user", "content": "..."},
  {"role": "assistant", "content": "..."}
]}

SFT Rounds

Round Base Data LR Epochs Best Val
v1 IDK-1 step 25k 1,390 pairs 2e-5 3 3.0506
v2 IDK-1 step 25k 3,010 pairs 3e-5 5 2.1709
v3 sft_best v2 3,810 pairs 1e-5 3 2.0808
v4 sft_best v3 4,810 pairs 5e-6 3 1.3670

Training was done on Kaggle (T4 GPU) using PyTorch with loss masking on non-assistant tokens.

Special Tokens

<|im_start|>  β†’ id 40000
<|im_end|>    β†’ id 40001

Usage

import torch
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
im_start = tokenizer.token_to_id("<|im_start|>")
im_end   = tokenizer.token_to_id("<|im_end|>")

def build_prompt(user_message):
    return f"<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant\n"

# Load model (see IDK-1 repo for model definition)
# model = IDK1Model(IDK1Config())
# ckpt = torch.load("sft_best.pt", map_location="cpu")
# model.load_state_dict(ckpt["model"])

prompt = build_prompt("Jelaskan apa itu kecerdasan buatan dalam 3 poin.")

Limitations

  • Open-ended reasoning β€” complex topics may drift or produce incoherent output. Root cause: noisy CulturaX pre-training data + 100M param ceiling.
  • Knowledge cutoff β€” pre-trained on Wikipedia ID + CulturaX ID snapshots. No real-time knowledge.
  • Context length β€” max 512 tokens. Not suitable for long-document tasks.
  • Language β€” optimized for Indonesian. English or mixed-language prompts may degrade quality.
  • Not for production β€” this is a research/learning project. Do not use for medical, legal, or safety-critical applications.

What Works Well

  • βœ… Count-following instructions ("Sebutkan 3 hal tentang...")
  • βœ… Short factual Q&A in Indonesian
  • βœ… Simple summarization
  • βœ… Practical tips and how-to explanations
  • βœ… Basic conversational responses

Project

IDK-1 was built as a learning + portfolio project to demonstrate training an Indonesian SLM from scratch on commodity hardware (Kaggle free tier).


Citation

@misc{idk1instruct2026,
  title  = {IDK-1-Instruct: Instruction-tuned Indonesian Small Language Model},
  author = {Muhammad Rifky Firmansyah Sujana},
  year   = {2026},
  url    = {https://huggingface.co/idk-ai/IDK-1-Instruct}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ripkiiiii/IDK-1-Instruct

Space using ripkiiiii/IDK-1-Instruct 1