๐ŸŒŒ Kshana-170M Base

Kshana-170M-Base is a compact 170M-parameter foundational causal language model built by Abiray. Moving along the architectural lineage of its predecessor (Sutra), Kshana is trained from scratch using a highly optimized Llama-style architecture with Grouped-Query Attention (GQA) for blazing inference velocity.

Despite its compact size, it achieves highly competitive results on key reasoning benchmarks, making it an optimal base for downstream fine-tuning workflows or resource-constrained edge deployment.

Note: As a raw base model, it requires downstream instruction tuning to perform as a conversational chat agent.

๐Ÿ† Benchmarks

The base weights were evaluated head-to-head against sub-500M architectures using lm-evaluation-harness within an identical runtime environment. To align with open-source presentation standards, scores reflect peak performance metric selection targets (acc for science and single-token knowledge choice selections, acc_norm for length-penalized situational context completions).

Benchmark ๐ŸŒŒ Kshana-170M (Ours) ๐Ÿชต SmolLM2-135M ๐ŸŒพ Nandi-Mini-150M ๐Ÿ“ Pythia-160m ๐Ÿ”น OPT-125m ๐Ÿงฎ Cerebras-256M โš™๏ธ Pythia-410m
Parameters 169.9M 135M 150M 160M 125M 256M 410M
SciQ (Sci) 81.90% 84.10% 89.10% 55.70% 78.20% 75.70% 80.40%
PIQA (Logic) 66.81% 68.34% 65.13% 59.19% 62.62% 61.10% 66.70%
ARC-Easy (Know) 57.07% 64.39% 54.67% 37.58% 42.76% 40.99% 51.98%
HellaSwag (Ctx) 39.84% 43.17% 37.11% 30.49% 31.62% 28.60% 40.02%

๐Ÿง  Model Architecture

Kshana-170M is based on the LlamaForCausalLM architecture with a native Grouped-Query Attention (GQA) layout to compress hardware footprint:

Parameter Value
Parameters 169,906,752
Hidden size 576
Layers 32
Attention heads 9
KV heads (GQA) 3
Head dimension 64
Intermediate size 1,536
Activation SwiGLU (silu)
Max Context 8,192 tokens
Vocabulary size 49,152

โš™๏ธ Training Configuration

Parameter Value
Optimizer AdamW
Learning rate 3e-4
LR scheduler Cosine Decay
Precision bfloat16 / float16 hybrid

๐Ÿ“š Training Data

Trained on a volume of 65 Billion tokens. The corpus characteristics include high-quality deduplicated web extracts, structured synthetic reasoning texts, and educational literature subsets (focusing on FineWeb-Edu, Wikipedia, and Cosmopedia). Data was rigorously filtered using MinHash LSH deduplication and language filtering matrices.

๐ŸŽฏ Operational Scope & Intended Use

โœ… Targeted Applications

  • Downstream Fine-Tuning (SFT/DPO): Acts as a clean, lightweight base for training specialized assistants, custom chat agents, or task-specific models.
  • Local & Edge Deployment: Designed with Grouped-Query Attention (GQA) for efficient quantization (via llama.cpp / GGUF), making it ideal for low-power hardware like consumer CPUs, laptops, and mobile devices.
  • Text Completion & Routing: Well-suited for low-latency text continuation, basic autocomplete features, or classification tasks like routing user queries quickly before passing them to larger models.

โŒ Out-of-Scope Limits

  • Coding & Mathematics: The model's training data consists strictly of natural language text (FineWeb-Edu and Cosmopedia). Because it was never exposed to structured math datasets or code repositories during training, it cannot write code scripts, debug software, or calculate mathematical formulas.
  • Factual Knowledge Retrieval: Trained on a strict budget of 65 Billion tokens with a sub-200M parameter boundary, the model lacks the capacity to serve as an open-domain factual encyclopedia. It will hallucinate facts if asked about niche topics without being provided reference text directly in the prompt (e.g., via RAG).
  • Interactive Chat (Out of the box): As a raw base model, it will naturally attempt to autocomplete text rather than hold a conversational dialogue. It requires standard instruction fine-tuning before it can be used as a traditional chatbot.

๐Ÿš€ Inference & Edge Deployment

The model can be initialized within minutes using standard workflows via the Hugging Face transformers environment. Its native GQA layout makes it highly compatible with quantization layers (via llama.cpp / GGUF) to run on consumer CPUs or embedded devices at extreme tokens-per-second.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Abiray/Kshana-170M-Base"

# Initialize matching vocabulary tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

# Pull weights matching verified float16 layout
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    token=True
)

prompt = "The basic physical principle behind gravitational collapse is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=64,
        temperature=0.6,
        top_p=0.85,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
81
Safetensors
Model size
0.2B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Abiray/Kshana-170M-Base 1

Collection including Abiray/Kshana-170M-Base