๐ Kshana-170M Base
Kshana-170M-Base is a compact 170M-parameter foundational causal language model built by Abiray. Moving along the architectural lineage of its predecessor (Sutra), Kshana is trained from scratch using a highly optimized Llama-style architecture with Grouped-Query Attention (GQA) for blazing inference velocity.
Despite its compact size, it achieves highly competitive results on key reasoning benchmarks, making it an optimal base for downstream fine-tuning workflows or resource-constrained edge deployment.
Note: As a raw base model, it requires downstream instruction tuning to perform as a conversational chat agent.
๐ Benchmarks
The base weights were evaluated head-to-head against sub-500M architectures using lm-evaluation-harness within an identical runtime environment. To align with open-source presentation standards, scores reflect peak performance metric selection targets (acc for science and single-token knowledge choice selections, acc_norm for length-penalized situational context completions).
| Benchmark | ๐ Kshana-170M (Ours) | ๐ชต SmolLM2-135M | ๐พ Nandi-Mini-150M | ๐ Pythia-160m | ๐น OPT-125m | ๐งฎ Cerebras-256M | โ๏ธ Pythia-410m |
|---|---|---|---|---|---|---|---|
| Parameters | 169.9M | 135M | 150M | 160M | 125M | 256M | 410M |
| SciQ (Sci) | 81.90% | 84.10% | 89.10% | 55.70% | 78.20% | 75.70% | 80.40% |
| PIQA (Logic) | 66.81% | 68.34% | 65.13% | 59.19% | 62.62% | 61.10% | 66.70% |
| ARC-Easy (Know) | 57.07% | 64.39% | 54.67% | 37.58% | 42.76% | 40.99% | 51.98% |
| HellaSwag (Ctx) | 39.84% | 43.17% | 37.11% | 30.49% | 31.62% | 28.60% | 40.02% |
๐ง Model Architecture
Kshana-170M is based on the LlamaForCausalLM architecture with a native Grouped-Query Attention (GQA) layout to compress hardware footprint:
| Parameter | Value |
|---|---|
| Parameters | 169,906,752 |
| Hidden size | 576 |
| Layers | 32 |
| Attention heads | 9 |
| KV heads (GQA) | 3 |
| Head dimension | 64 |
| Intermediate size | 1,536 |
| Activation | SwiGLU (silu) |
| Max Context | 8,192 tokens |
| Vocabulary size | 49,152 |
โ๏ธ Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 3e-4 |
| LR scheduler | Cosine Decay |
| Precision | bfloat16 / float16 hybrid |
๐ Training Data
Trained on a volume of 65 Billion tokens. The corpus characteristics include high-quality deduplicated web extracts, structured synthetic reasoning texts, and educational literature subsets (focusing on FineWeb-Edu, Wikipedia, and Cosmopedia). Data was rigorously filtered using MinHash LSH deduplication and language filtering matrices.
๐ฏ Operational Scope & Intended Use
โ Targeted Applications
- Downstream Fine-Tuning (SFT/DPO): Acts as a clean, lightweight base for training specialized assistants, custom chat agents, or task-specific models.
- Local & Edge Deployment: Designed with Grouped-Query Attention (GQA) for efficient quantization (via
llama.cpp/ GGUF), making it ideal for low-power hardware like consumer CPUs, laptops, and mobile devices. - Text Completion & Routing: Well-suited for low-latency text continuation, basic autocomplete features, or classification tasks like routing user queries quickly before passing them to larger models.
โ Out-of-Scope Limits
- Coding & Mathematics: The model's training data consists strictly of natural language text (FineWeb-Edu and Cosmopedia). Because it was never exposed to structured math datasets or code repositories during training, it cannot write code scripts, debug software, or calculate mathematical formulas.
- Factual Knowledge Retrieval: Trained on a strict budget of 65 Billion tokens with a sub-200M parameter boundary, the model lacks the capacity to serve as an open-domain factual encyclopedia. It will hallucinate facts if asked about niche topics without being provided reference text directly in the prompt (e.g., via RAG).
- Interactive Chat (Out of the box): As a raw base model, it will naturally attempt to autocomplete text rather than hold a conversational dialogue. It requires standard instruction fine-tuning before it can be used as a traditional chatbot.
๐ Inference & Edge Deployment
The model can be initialized within minutes using standard workflows via the Hugging Face transformers environment. Its native GQA layout makes it highly compatible with quantization layers (via llama.cpp / GGUF) to run on consumer CPUs or embedded devices at extreme tokens-per-second.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Abiray/Kshana-170M-Base"
# Initialize matching vocabulary tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
# Pull weights matching verified float16 layout
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
token=True
)
prompt = "The basic physical principle behind gravitational collapse is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=64,
temperature=0.6,
top_p=0.85,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 81