🌌 Kshana-170M Base

Kshana-170M-Base is a compact 170M-parameter foundational causal language model built by Abiray. Moving along the architectural lineage of its predecessor (Sutra), Kshana is trained from scratch using a highly optimized Llama-style architecture with Grouped-Query Attention (GQA) for blazing inference velocity.

Despite its compact size, it achieves highly competitive results on key reasoning benchmarks, making it an optimal base for downstream fine-tuning workflows or resource-constrained edge deployment.

Note: As a raw base model, it requires downstream instruction tuning to perform as a conversational chat agent.

🏆 Benchmarks

The base weights were evaluated head-to-head against sub-500M architectures using lm-evaluation-harness within an identical runtime environment. To align with open-source presentation standards, scores reflect peak performance metric selection targets (acc for science and single-token knowledge choice selections, acc_norm for length-penalized situational context completions).

Benchmark	🌌 Kshana-170M (Ours)	🪵 SmolLM2-135M	🌾 Nandi-Mini-150M	📐 Pythia-160m	🔹 OPT-125m	🧮 Cerebras-256M	⚙️ Pythia-410m
Parameters	169.9M	135M	150M	160M	125M	256M	410M
SciQ (Sci)	81.90%	84.10%	89.10%	55.70%	78.20%	75.70%	80.40%
PIQA (Logic)	66.81%	68.34%	65.13%	59.19%	62.62%	61.10%	66.70%
ARC-Easy (Know)	57.07%	64.39%	54.67%	37.58%	42.76%	40.99%	51.98%
HellaSwag (Ctx)	39.84%	43.17%	37.11%	30.49%	31.62%	28.60%	40.02%

🧠 Model Architecture

Kshana-170M is based on the LlamaForCausalLM architecture with a native Grouped-Query Attention (GQA) layout to compress hardware footprint:

Parameter	Value
Parameters	169,906,752
Hidden size	576
Layers	32
Attention heads	9
KV heads (GQA)	3
Head dimension	64
Intermediate size	1,536
Activation	SwiGLU (`silu`)
Max Context	8,192 tokens
Vocabulary size	49,152

⚙️ Training Configuration

Parameter	Value
Optimizer	AdamW
Learning rate	3e-4
LR scheduler	Cosine Decay
Precision	`bfloat16` / `float16` hybrid

📚 Training Data

Trained on a volume of 65 Billion tokens. The corpus characteristics include high-quality deduplicated web extracts, structured synthetic reasoning texts, and educational literature subsets (focusing on FineWeb-Edu, Wikipedia, and Cosmopedia). Data was rigorously filtered using MinHash LSH deduplication and language filtering matrices.

🎯 Operational Scope & Intended Use

✅ Targeted Applications

Downstream Fine-Tuning (SFT/DPO): Acts as a clean, lightweight base for training specialized assistants, custom chat agents, or task-specific models.
Local & Edge Deployment: Designed with Grouped-Query Attention (GQA) for efficient quantization (via llama.cpp / GGUF), making it ideal for low-power hardware like consumer CPUs, laptops, and mobile devices.
Text Completion & Routing: Well-suited for low-latency text continuation, basic autocomplete features, or classification tasks like routing user queries quickly before passing them to larger models.

❌ Out-of-Scope Limits

Coding & Mathematics: The model's training data consists strictly of natural language text (FineWeb-Edu and Cosmopedia). Because it was never exposed to structured math datasets or code repositories during training, it cannot write code scripts, debug software, or calculate mathematical formulas.
Factual Knowledge Retrieval: Trained on a strict budget of 65 Billion tokens with a sub-200M parameter boundary, the model lacks the capacity to serve as an open-domain factual encyclopedia. It will hallucinate facts if asked about niche topics without being provided reference text directly in the prompt (e.g., via RAG).
Interactive Chat (Out of the box): As a raw base model, it will naturally attempt to autocomplete text rather than hold a conversational dialogue. It requires standard instruction fine-tuning before it can be used as a traditional chatbot.

🚀 Inference & Edge Deployment

The model can be initialized within minutes using standard workflows via the Hugging Face transformers environment. Its native GQA layout makes it highly compatible with quantization layers (via llama.cpp / GGUF) to run on consumer CPUs or embedded devices at extreme tokens-per-second.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Abiray/Kshana-170M-Base"

# Initialize matching vocabulary tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

# Pull weights matching verified float16 layout
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    token=True
)

prompt = "The basic physical principle behind gravitational collapse is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=64,
        temperature=0.6,
        top_p=0.85,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 81

Safetensors

Model size

0.2B params

Tensor type

F16

Space using Abiray/Kshana-170M-Base 1

Collection including Abiray/Kshana-170M-Base

Kshana-170M

Collection

1 item • Updated 1 day ago