AhiskaAI GPT2 Base 125M (Pre-trained From Scratch)
AhiskaAI GPT2 Base 125M is a foundational, domain-agnostic causal language model trained entirely from scratch for the Turkish language ecosystem.
Instead of copying or fine-tuning existing multi-lingual weights, this model was initialized from zero, learning Turkish morphology, syntax, and semantics organically from a localized raw corpus. It serves as the core foundational asset for the AhıskaAI research lab.
🧠 Training Methodology & Dataset
To build a structurally sound foundation for Turkish NLP, the model was pre-trained on a 5.3 GB highly filtered, deduplicated, and clean text corpus:
- CulturaX (Turkish Split): Utilized for deep web-scale text distributions, giving the model broad world knowledge and conversational fluidness.
- Wikipedia (Turkish): Embedded for dense, factual, and encyclopedic data representation to minimize hallucinations in downstream tasks.
💻 Hardware Constraints & "Fail Forward" Philosophy
At AhıskaAI, we strictly believe in transparency and the indie "Build in Public" spirit. This model was not computed on expensive industrial server farms; it was hammered out on local, consumer-grade hardware:
- Hardware: NVIDIA GeForce RTX 4050 Laptop GPU (6GB VRAM)
- Training Depth: Due to strict local VRAM and hardware limitations, the model was trained for 0.3 epochs.
- Current Convergence State: While the weights have not fully converged to a global minimum, the model has successfully captured fundamental Turkish grammatical structures and token-nesting patterns. It is highly receptive to checkpoint continuation (resume training) or immediate downstream instruction fine-tuning (SFT).
🛠️ Custom Optimized Turkish Tokenizer
Unlike standard English-centric GPT-2 tokenizers that aggressively break Turkish words into meaningless, high-loss sub-tokens, this model utilizes a custom BPE (Byte Pair Encoding) Tokenizer trained completely from scratch on our Turkish corpus.
- Efficiency: It natively recognizes agglutinative Turkish suffixes, drastically lowering the token-to-word ratio and maximizing the efficiency of the 512 context window.
📊 Model Architecture & Specifications
- Architecture:
GPT2LMHeadModel(Causal Language Modeling) - Parameters: ~125 Million
- Hidden Size (
n_embd): 768 - Layers (
n_layer): 12 - Attention Heads (
n_head): 12 - Context Length (
n_ctx): 512 tokens - Precision:
float32
🛠️ Quickstart Usage
You can easily load and run inference or continue pre-training using the Hugging Face transformers library:
from transformers import GPT2LMHeadModel, AutoTokenizer
model_name = "AhiskaAI/ahiska-gpt2-base-125m"
# Load model and custom Turkish tokenizer
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Sample generation
inputs = tokenizer("Ahıska Türkleri,", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 20