AhiskaAI 25M Base v0.1 (Pre-trained From Scratch)

AhiskaAI 25M Base v0.1 is an ultra-lightweight, foundational causal language model trained entirely from scratch for the Turkish language ecosystem.

With only ~25 Million active parameters, this model was initialized from absolute zero to explore the lower limits of grammatical compression, syntax mapping, and localized semantic density within highly constrained computational environments. It serves as the raw, unaligned backbone for the AhıskaAI research pipeline.


🧠 Training Profile & Dataset

Despite its miniature footprint, the model captures core Turkish morphology and factual token paths due to a highly strategic, clean data mixture of a 5.3 GB base corpus:

  • CulturaX (Turkish Split): Utilized for deep web-scale text distributions, teaching the network core token connectivity, basic sentence boundaries, and general Turkish vocabulary.
  • Custom Filtered Wikipedia (75 MB): A highly curated, hand-filtered subset of Turkish Wikipedia explicitly processed by AhıskaAI to target rich historical timelines, cultural identity milestones, and factual knowledge. This dense sub-matrix is the primary driver behind the model's factual recall capabilities despite its size.

💻 Hardware & Infrastructure

  • Hardware: NVIDIA GeForce RTX 4050 Laptop GPU (6GB VRAM)
  • Training Depth: Trained under strict local VRAM constraints with an indie "Build in Public" ethos.

🛠️ Quickstart & Inference

You can easily load and run text generation using the Hugging Face transformers library.

from transformers import GPT2LMHeadModel, AutoTokenizer
import torch

model_name = "AhiskaAI/AhiskaAI-25m-Base-v0.1"

# Load model and custom Turkish tokenizer
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure pad token for GPT-2 architecture
tokenizer.pad_token = tokenizer.eos_token

# Sample generation
prompt = "Ahıska Türkleri,"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=100, 
        do_sample=True, 
        top_k=50, 
        top_p=0.95,
        temperature=0.7,
        no_repeat_ngram_size=2
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

⚠️ Model Limitations & Intended Use Pure Base Model: This is a raw text completer and has not been aligned using Instruction Fine-Tuning (SFT) or RLHF. It will not act as a conversational chatbot out of the box and might loop sequences if not sampled correctly.

Intended Use: This model is highly receptive to immediate downstream SFT fine-tuning, vocabulary adaptations, or continuation of pre-training checkpoints.

For the aligned chat version, please check out: AhiskaAI-25m-Chat-v0.1-Experimental

Downloads last month
113
Safetensors
Model size
50.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AhiskaAI/AhiskaAI-25m-Base-v0.1

Collection including AhiskaAI/AhiskaAI-25m-Base-v0.1