AhiskaAI GPT2 Base 125M (Pre-trained From Scratch)

AhiskaAI GPT2 Base 125M is a foundational, domain-agnostic causal language model trained entirely from scratch for the Turkish language ecosystem.

Instead of copying or fine-tuning existing multi-lingual weights, this model was initialized from zero, learning Turkish morphology, syntax, and semantics organically from a localized raw corpus. It serves as the core foundational asset for the AhıskaAI research lab.


🧠 Training Methodology & Dataset

To build a structurally sound foundation for Turkish NLP, the model was pre-trained on a 5.3 GB highly filtered, deduplicated, and clean text corpus:

  • CulturaX (Turkish Split): Utilized for deep web-scale text distributions, giving the model broad world knowledge and conversational fluidness.
  • Wikipedia (Turkish): Embedded for dense, factual, and encyclopedic data representation to minimize hallucinations in downstream tasks.

💻 Hardware Constraints & "Fail Forward" Philosophy

At AhıskaAI, we strictly believe in transparency and the indie "Build in Public" spirit. This model was not computed on expensive industrial server farms; it was hammered out on local, consumer-grade hardware:

  • Hardware: NVIDIA GeForce RTX 4050 Laptop GPU (6GB VRAM)
  • Training Depth: Due to strict local VRAM and hardware limitations, the model was trained for 0.3 epochs.
  • Current Convergence State: While the weights have not fully converged to a global minimum, the model has successfully captured fundamental Turkish grammatical structures and token-nesting patterns. It is highly receptive to checkpoint continuation (resume training) or immediate downstream instruction fine-tuning (SFT).

🛠️ Custom Optimized Turkish Tokenizer

Unlike standard English-centric GPT-2 tokenizers that aggressively break Turkish words into meaningless, high-loss sub-tokens, this model utilizes a custom BPE (Byte Pair Encoding) Tokenizer trained completely from scratch on our Turkish corpus.

  • Efficiency: It natively recognizes agglutinative Turkish suffixes, drastically lowering the token-to-word ratio and maximizing the efficiency of the 512 context window.

📊 Model Architecture & Specifications

  • Architecture: GPT2LMHeadModel (Causal Language Modeling)
  • Parameters: ~125 Million
  • Hidden Size (n_embd): 768
  • Layers (n_layer): 12
  • Attention Heads (n_head): 12
  • Context Length (n_ctx): 512 tokens
  • Precision: float32

🛠️ Quickstart Usage

You can easily load and run inference or continue pre-training using the Hugging Face transformers library:

from transformers import GPT2LMHeadModel, AutoTokenizer

model_name = "AhiskaAI/ahiska-gpt2-base-125m"

# Load model and custom Turkish tokenizer
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Sample generation
inputs = tokenizer("Ahıska Türkleri,", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, do_sample=True, top_k=50, top_p=0.95)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
20
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AhiskaAI/AhiskaAI-125m-Base-v0.1

Collection including AhiskaAI/AhiskaAI-125m-Base-v0.1