Islam Kathat — 350M Parameter Language Model

A 350 million parameter GPT-style language model trained from scratch by Islam Kathat, an AI/ML Engineer and Full Stack Developer.

This model was built entirely scratch, architecture design, data pipeline, pretraining, and instruction tuning, as an independent AI/ML research project.

Model Details

Property	Value
Parameters	353.6M
Architecture	GPT (decoder-only transformer)
Layers	24
Attention heads	16
Hidden size	1024
Context length	2048 tokens
Vocabulary	50,304 (GPT-2 tokenizer, padded to multiple of 64)
Normalization	RMSNorm
Attention	Flash Attention (scaled dot-product)
Positional encoding	Learned positional embeddings
Weight tying	Input embedding ↔ output projection

Training

Phase 1 — Pretraining

Property	Value
Dataset	FineWeb-Edu (sample-10BT)
Tokens trained	6.83 billion
Steps	6,314
Batch size	512 sequences × 2048 tokens = ~1M tokens/step
Optimizer	AdamW (fused), β₁=0.9, β₂=0.95, weight decay=0.1
Learning rate	3e-4 → 3e-5 (cosine decay with linear warmup)
Precision	BF16
Hardware	NVIDIA A100 SXM 80GB
Final val loss	~3.04
Final perplexity	~20.9

Phase 2 — Instruction Tuning

Property	Value
Dataset	OpenHermes 2.5
Samples	746,250 instruction-response pairs
Tokens	288M
Steps	8,000
Learning rate	1e-5 → 1e-6 (cosine)
Format	`### Human: {question}\n### Assistant: {answer}<\|endoftext\|>`

Phase 3 - Identity Tuning

Property	Value
Dataset	Custom identity dataset
Examples	~936
Tokens	~172K
Purpose	Teach model information about Islam Kathat and personal projects
Learning Rate	1e-6
Steps	200

Examples

Prompt: what is Machine Learning? Model: Machine learning is a subset of artificial intelligence with the objective of improving, verifying or optimizing systems for specific tasks using algorithms that generate unstructured, meaningful output.

Prompt: what is the future of AI? Model: The future of AI is uncertain. We are seeing great advances in machine learning, artificial intelligence, robotics, and personalized medicine. And we will continue to learn, innovate, and adapt as our skills improve, but I don�t want to tell you that AI isn�t here yet — because it has been around for a long time.

Quick Start

# ============================================================
# Load and Run Islam Kathat's 350M LLM from Hugging Face
# ============================================================

# !pip install -q torch tiktoken huggingface_hub

import torch
import torch.nn as nn
import torch.nn.functional as F
import tiktoken
from huggingface_hub import hf_hub_download

# ------------------------------------------------------------
# Download checkpoint from Hugging Face
# ------------------------------------------------------------

MODEL_PATH = hf_hub_download(
    repo_id="FazeFlynn/my-350M-LLM",
    filename="llm-350m.pt"
)

# ------------------------------------------------------------
# Device
# ------------------------------------------------------------

device = "cuda" if torch.cuda.is_available() else "cpu"

# ------------------------------------------------------------
# Load checkpoint
# ------------------------------------------------------------

ckpt = torch.load(
    MODEL_PATH,
    map_location=device,
    weights_only=False
)

print("Checkpoint Keys:")
print(ckpt.keys())

config = ckpt["model_config"]

# ------------------------------------------------------------
# Model Definition
# ------------------------------------------------------------

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        return x * torch.rsqrt(
            x.pow(2).mean(-1, keepdim=True) + self.eps
        ) * self.weight


class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.n_head = config["n_head"]
        self.n_embd = config["n_embd"]
        self.head_dim = self.n_embd // self.n_head

        self.c_attn = nn.Linear(
            self.n_embd,
            3 * self.n_embd,
            bias=config["bias"]
        )

        self.c_proj = nn.Linear(
            self.n_embd,
            self.n_embd,
            bias=config["bias"]
        )

    def forward(self, x):
        B, T, C = x.shape

        q, k, v = self.c_attn(x).split(
            self.n_embd,
            dim=2
        )

        q = q.view(
            B, T, self.n_head, self.head_dim
        ).transpose(1, 2)

        k = k.view(
            B, T, self.n_head, self.head_dim
        ).transpose(1, 2)

        v = v.view(
            B, T, self.n_head, self.head_dim
        ).transpose(1, 2)

        y = F.scaled_dot_product_attention(
            q,
            k,
            v,
            is_causal=True
        )

        y = (
            y.transpose(1, 2)
            .contiguous()
            .view(B, T, C)
        )

        return self.c_proj(y)


class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.c_fc = nn.Linear(
            config["n_embd"],
            4 * config["n_embd"],
            bias=config["bias"]
        )

        self.c_proj = nn.Linear(
            4 * config["n_embd"],
            config["n_embd"],
            bias=config["bias"]
        )

        self.act = nn.GELU()

    def forward(self, x):
        return self.c_proj(
            self.act(self.c_fc(x))
        )


class Block(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.ln1 = RMSNorm(config["n_embd"])
        self.attn = CausalSelfAttention(config)

        self.ln2 = RMSNorm(config["n_embd"])
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.wte = nn.Embedding(
            config["vocab_size"],
            config["n_embd"]
        )

        self.wpe = nn.Embedding(
            config["block_size"],
            config["n_embd"]
        )

        self.blocks = nn.ModuleList([
            Block(config)
            for _ in range(config["n_layer"])
        ])

        self.ln_f = RMSNorm(config["n_embd"])

        self.lm_head = nn.Linear(
            config["n_embd"],
            config["vocab_size"],
            bias=False
        )

        self.wte.weight = self.lm_head.weight

    def forward(self, idx):
        B, T = idx.shape

        pos = torch.arange(
            T,
            device=idx.device
        )

        x = self.wte(idx) + self.wpe(pos)

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)

        logits = self.lm_head(x)

        return logits


# ------------------------------------------------------------
# Create model
# ------------------------------------------------------------

model = GPT(config).to(device)

# ------------------------------------------------------------
# Load weights
# ------------------------------------------------------------

if "model_state_dict" in ckpt:
    model.load_state_dict(ckpt["model_state_dict"])

elif "model" in ckpt:
    model.load_state_dict(ckpt["model"])

else:
    raise ValueError(
        f"Unknown checkpoint keys: {ckpt.keys()}"
    )

model.eval()

# ------------------------------------------------------------
# Tokenizer
# ------------------------------------------------------------

enc = tiktoken.get_encoding("gpt2")

# ------------------------------------------------------------
# Generate Function
# ------------------------------------------------------------

@torch.no_grad()
def generate(
    prompt,
    max_new_tokens=200,
    temperature=0.8,
    top_k=50
):
    formatted = (
        f"### Human: {prompt}\n"
        f"### Assistant:"
    )

    ids = enc.encode(formatted)

    x = torch.tensor(
        [ids],
        dtype=torch.long,
        device=device
    )

    for _ in range(max_new_tokens):

        logits = model(x[:, -2048:])

        logits = logits[:, -1, :] / temperature

        v, _ = torch.topk(logits, top_k)

        logits[
            logits < v[:, [-1]]
        ] = float("-inf")

        probs = F.softmax(
            logits,
            dim=-1
        )

        next_token = torch.multinomial(
            probs,
            num_samples=1
        )

        x = torch.cat(
            [x, next_token],
            dim=1
        )

        if next_token.item() == enc.eot_token:
            break

    text = enc.decode(
        x[0].tolist()
    )

    return (
        text.split("### Assistant:")[-1]
            .split("<|endoftext|>")[0]
            .strip()
    )

# ------------------------------------------------------------
# Test
# ------------------------------------------------------------

print(generate("Who created you?"))
print()
print(generate("What is machine learning?"))
print()
print(generate("what is the future of Ai?"))

Model Architecture

The model uses a standard GPT decoder-only transformer architecture with modern improvements:

class RMSNorm(nn.Module):
    """RMSNorm instead of LayerNorm — faster, no mean subtraction."""
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    def forward(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config["n_head"]
        self.n_embd = config["n_embd"]
        self.head_dim = self.n_embd // self.n_head
        self.c_attn = nn.Linear(self.n_embd, 3 * self.n_embd, bias=config["bias"])
        self.c_proj = nn.Linear(self.n_embd, self.n_embd, bias=config["bias"])
    def forward(self, x):
        B, T, C = x.shape
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        return self.c_proj(y.transpose(1, 2).contiguous().view(B, T, C))

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config["n_embd"], 4 * config["n_embd"], bias=config["bias"])
        self.c_proj = nn.Linear(4 * config["n_embd"], config["n_embd"], bias=config["bias"])
        self.act = nn.GELU()
    def forward(self, x):
        return self.c_proj(self.act(self.c_fc(x)))

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = RMSNorm(config["n_embd"])
        self.attn = CausalSelfAttention(config)
        self.ln2 = RMSNorm(config["n_embd"])
        self.mlp = MLP(config)
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.wte = nn.Embedding(config["vocab_size"], config["n_embd"])
        self.wpe = nn.Embedding(config["block_size"], config["n_embd"])
        self.blocks = nn.ModuleList([Block(config) for _ in range(config["n_layer"])])
        self.ln_f = RMSNorm(config["n_embd"])
        self.lm_head = nn.Linear(config["n_embd"], config["vocab_size"], bias=False)
        self.wte.weight = self.lm_head.weight  # weight tying
    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos = torch.arange(T, device=idx.device)
        x = self.wte(idx) + self.wpe(pos)
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
        return logits, loss

Checkpoint Format

The .pt file is a standard PyTorch checkpoint:

{
    "model_state_dict":     ...,   # model weights
    "optimizer_state_dict": ...,   # AdamW states
    "model_config": {
        "vocab_size":  50304,
        "n_layer":     24,
        "n_head":      16,
        "n_embd":      1024,
        "block_size":  2048,
        "dropout":     0.0,
        "bias":        False,
    },
    "step":          ...,
    "best_val_loss": ...,
}

Limitations

Not aligned for safety — no RLHF or DPO alignment has been applied
Factual accuracy — may hallucinate facts, especially specific numbers and dates
English only — trained primarily on English web text
Context length — limited to 2048 tokens
Small scale — 350M parameters is capable but limited compared to larger models

Training Infrastructure

Pretraining: RunPod A100 SXM 80GB (~$35 total compute cost)
Data pipeline: Google Colab (free tier) for download and tokenization
Framework: PyTorch 2.4, custom training loop (no HuggingFace Trainer)
Monitoring: Weights & Biases

About the Author

Islam Kathat — AI/ML Engineer & Full Stack Developer

GitHub: FazeFlynn
LinkedIn: islam-khan
Email: faiz.14a@gmail.com

This model was built as an independent research project to deeply understand the full LLM training pipeline — from raw data to a conversational model.

Citation

@misc{kathat2026llm350m,
  author    = {Islam Kathat},
  title     = {350M Parameter GPT Language Model},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/FazeFlynn/my-350M-LLM}
}

Intended Use

This model is intended for:

Research
Education
Learning about LLM training
Text generation
Conversational AI

This model is not intended for:

Medical advice
Legal advice
Financial decisions
Safety-critical systems

License

MIT License — free to use, modify, and distribute with attribution.

Downloads last month: -; Downloads are not tracked for this model. How to track

FazeFlynn
/

my-350M-LLM