Bhineka-GPT-500M

Bhineka-GPT-500M is a bilingual Indonesian-English causal language model built with a Llama-style decoder-only architecture. The final pretraining checkpoint contains 556.3M trainable parameters and was validated on 53.7M held-out tokens across English web, Indonesian web, code, and math domains, reaching an overall validation loss of 2.5355 and perplexity of 12.62.

The project is designed as an end-to-end training pipeline, covering dataset download, text cleaning, language filtering, deduplication, tokenizer training, shard building, curriculum sampling, pretraining, supervised fine-tuning, direct preference optimization, and final Hugging Face export.

The model is intended for Indonesian and English text generation tasks such as question answering, summarization, rewriting, translation, technical drafting, code assistance, document understanding, and structured Markdown or JSON-style responses.

Model Details

  • Model type: decoder-only causal language model
  • Architecture: Llama-style Transformer
  • Languages: Indonesian and English
  • Parameters: 556,269,696 in the validated final pretraining checkpoint
  • Context length: 2048 tokens
  • Tokenizer: BPE tokenizer trained for this project
  • Vocabulary size: 64,000 tokens in the current pipeline config
  • Hidden size: 1152
  • Layers: 28
  • Attention heads: 16
  • Key/value heads: 8, using grouped-query attention
  • Feed-forward size: 3072, using SwiGLU-style activation
  • Positional encoding: RoPE
  • Normalization: RMSNorm
  • Precision target: bfloat16
  • Validation checkpoint: checkpoints/pretrain/final

Training Pipeline

Bhineka-GPT-500M is produced through the following stages:

  1. Dataset download from public Hugging Face datasets
  2. Rule-based cleaning and quality filtering
  3. Exact and MinHash deduplication
  4. BPE tokenizer training
  5. Binary shard creation
  6. Domain-weighted curriculum sampling
  7. Causal language model pretraining
  8. Supervised fine-tuning on instruction/chat datasets
  9. Direct Preference Optimization
  10. Export to Hugging Face format with safetensors

Training Data

The current project configuration targets a bilingual and technical mixture with approximately 12.5B total pretraining tokens:

Domain Approximate Target Tokens Purpose
English high-quality web 8.7B General knowledge, reasoning, writing
Indonesian high-quality web 2.15B Indonesian language coverage and local text style
Code 1.05B Python, JavaScript, Go, SQL, and technical generation
Math / academic 600M Mathematical and academic text exposure

Main pretraining sources include FineWeb, FineWeb-Edu, CulturaX Indonesian, mC4 Indonesian, Indonesian Wikipedia, GitHub code subsets, and OpenWebMath.

Instruction tuning data configured in the project includes Alpaca-style and chat-style datasets such as Alpaca Cleaned, Dolly 15k, Alpaca Indonesian, Alpaca GPT-4 Indonesian, OpenHermes 2.5, and SlimOrca.

Intended Uses

This model is intended for research, experimentation, and application prototyping in Indonesian-English language tasks, including:

  • General chat and instruction following
  • Indonesian and English question answering
  • Indonesian-English translation
  • Summarization and rewriting
  • Technical explanation and drafting
  • Python, JavaScript, Go, and SQL code assistance
  • Markdown and structured response generation

Out-of-Scope Uses

This model should not be used as the sole source of truth for high-stakes decisions, including medical, legal, financial, safety-critical, or emergency contexts. It should also not be used to generate harmful instructions, impersonation, spam, fraud, or privacy-invasive content.

Limitations

  • The model may hallucinate facts, citations, code behavior, or numerical details.
  • Performance may vary across Indonesian dialects, informal registers, and domain-specific terminology.
  • The model can reflect biases and quality issues present in public web, code, math, and instruction datasets.
  • Smaller language models may struggle with long reasoning chains, complex tool use, and strict factuality.
  • The reported validation-loss results cover language-modeling loss only; broader instruction-following, safety, factuality, and downstream task evaluations are still recommended before production use.

Evaluation

Validation loss was measured with scripts/run_validation_loss.py on the final pretraining checkpoint:

  • Checkpoint: checkpoints/pretrain/final
  • Evaluation date: 2026-05-31
  • Device: CUDA
  • Evaluation dtype: float32
  • Context length: 2048 tokens
  • Batch size: 4
  • Tokens evaluated: 53,678,481
  • Batches evaluated: 6,558
  • Tokenizer vocabulary: 64,000
  • Model vocabulary: 64,000
  • Random-loss baseline: 11.0666
  • Parameter check: 556,269,696 trainable parameters, no non-finite values reported
Domain Loss Perplexity Tokens Batches
Overall 2.5355 12.6227 53,678,481 6,558
Code 1.4304 4.1804 15,121,189 1,847
English high-quality web 3.1551 23.4543 20,635,807 2,521
Indonesian high-quality web 2.7256 15.2651 11,481,623 1,403
Math 2.8062 16.5462 6,439,862 787

Benchmark Comparison

The following benchmark table compares Bhineka-GPT with several small open-weight language models in the same approximate parameter range. These numbers should be read as an orientation benchmark rather than a perfectly fair leaderboard comparison, because evaluation harness settings, shot count, prompt format, checkpoint type, tokenizer, and instruction tuning status may differ across sources.

For Bhineka-GPT, ARC, HellaSwag, and WinoGrande were evaluated in 0-shot mode, while GSM8K used 5-shot evaluation.

Model Params ARC HellaSwag WinoGrande GSM8K Notes
Bhineka-GPT 556M 24.83 31.58 48.86 1.90 0-shot except GSM8K 5-shot
Pythia-410M-deduped ±410M / 0.5B 27.90 40.04 52.09 0.00 Open LLM Leaderboard-style evaluation, mostly few-shot [1]
Pythia-1B-deduped 1B 29.10 49.65 53.59 1.14 Larger model, trained with substantially more compute and data [2]
TinyLlama-1.1B Chat 1.1B 36.09 61.10 61.25 — Pretrained on approximately 3T tokens; target training setup reported as 16×A100 for about 90 days [3]
TinyLlama 1.1B variant 1.1B 30.29 55.12 55.80 0.53 Fine-tuned variant, Open LLM Leaderboard-style evaluation [4]
Qwen2-0.5B ±0.5B non-embedding 61.10 49.30 74.40 36.50 Much more mature model family; not a fair direct comparison against a from-scratch sub-$100 training experiment [5]

Interpretation:

  • Bhineka-GPT is competitive enough to be a useful research baseline for a from-scratch 556M bilingual Indonesian-English model, especially considering its limited training budget.
  • Larger or more mature models such as TinyLlama, Pythia-1B, and Qwen2-0.5B benefit from more training tokens, more mature infrastructure, and/or larger-scale optimization.
  • The comparison is most useful for positioning Bhineka-GPT as a lightweight experimental bilingual model, not as a claim of state-of-the-art performance.

These results measure next-token prediction quality on validation data. Recommended additional evaluations before release include Indonesian and English instruction-following benchmarks, translation quality checks, summarization and factuality tests, code generation tests, and safety testing.

Usage

After export and upload, the model can be loaded with Transformers. Because this project defines a custom bhineka model architecture, the model repository may need to include the custom modeling files and be loaded with trust_remote_code=True.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "BhinekaIntiLabs/bhineka-gpt"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "<|user|> Jelaskan apa itu deduplikasi data dalam pelatihan model bahasa.<|sep|><|assistant|>"
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    add_special_tokens=False, 
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    repetition_penalty=1.1,
    use_cache=False,     
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

completion = outputs[0, inputs["input_ids"].shape[1]:]
print(tokenizer.decode(completion, skip_special_tokens=True))

The exporter saves the model in Hugging Face format with safetensors, tokenizer files, config files, and generation config.

License

This model card declares the apache-2.0 license in the Hugging Face metadata. Please ensure that all training data usage, code dependencies, and released model artifacts are compatible with this license before publishing.

Citation

If you use this model or pipeline, cite the project repository:

@software{bhineka_llm_500m,
  title = {Bhineka-GPT-500M},
  author = {Bhineka-GPT contributors},
  year = {2026},
  note = {Bilingual Indonesian-English language model training pipeline}
}
Downloads last month
683
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train BhinekaIntiLabs/bhineka-gpt