Islam Kathat β€” 350M Parameter Language Model

A 350 million parameter GPT-style language model trained from scratch by Islam Kathat, an AI/ML Engineer and Full Stack Developer.

This model was built entirely scratch, architecture design, data pipeline, pretraining, and instruction tuning, as an independent AI/ML research project.


Model Details

Property Value
Parameters 353.6M
Architecture GPT (decoder-only transformer)
Layers 24
Attention heads 16
Hidden size 1024
Context length 2048 tokens
Vocabulary 50,304 (GPT-2 tokenizer, padded to multiple of 64)
Normalization RMSNorm
Attention Flash Attention (scaled dot-product)
Positional encoding Learned positional embeddings
Weight tying Input embedding ↔ output projection

Training

Phase 1 β€” Pretraining

Property Value
Dataset FineWeb-Edu (sample-10BT)
Tokens trained 6.83 billion
Steps 6,314
Batch size 512 sequences Γ— 2048 tokens = ~1M tokens/step
Optimizer AdamW (fused), β₁=0.9, Ξ²β‚‚=0.95, weight decay=0.1
Learning rate 3e-4 β†’ 3e-5 (cosine decay with linear warmup)
Precision BF16
Hardware NVIDIA A100 SXM 80GB
Final val loss ~3.04
Final perplexity ~20.9

Phase 2 β€” Instruction Tuning

Property Value
Dataset OpenHermes 2.5
Samples 746,250 instruction-response pairs
Tokens 288M
Steps 8,000
Learning rate 1e-5 β†’ 1e-6 (cosine)
Format ### Human: {question}\n### Assistant: {answer}<|endoftext|>

Phase 3 - Identity Tuning

Property Value
Dataset Custom identity dataset
Examples ~936
Tokens ~172K
Purpose Teach model information about Islam Kathat and personal projects
Learning Rate 1e-6
Steps 200

Examples

Prompt: what is Machine Learning? Model: Machine learning is a subset of artificial intelligence with the objective of improving, verifying or optimizing systems for specific tasks using algorithms that generate unstructured, meaningful output.

Prompt: what is the future of AI? Model: The future of AI is uncertain. We are seeing great advances in machine learning, artificial intelligence, robotics, and personalized medicine. And we will continue to learn, innovate, and adapt as our skills improve, but I donοΏ½t want to tell you that AI isnοΏ½t here yet β€” because it has been around for a long time.


Quick Start

# ============================================================
# Load and Run Islam Kathat's 350M LLM from Hugging Face
# ============================================================

# !pip install -q torch tiktoken huggingface_hub

import torch
import torch.nn as nn
import torch.nn.functional as F
import tiktoken
from huggingface_hub import hf_hub_download

# ------------------------------------------------------------
# Download checkpoint from Hugging Face
# ------------------------------------------------------------

MODEL_PATH = hf_hub_download(
    repo_id="FazeFlynn/my-350M-LLM",
    filename="llm-350m.pt"
)

# ------------------------------------------------------------
# Device
# ------------------------------------------------------------

device = "cuda" if torch.cuda.is_available() else "cpu"

# ------------------------------------------------------------
# Load checkpoint
# ------------------------------------------------------------

ckpt = torch.load(
    MODEL_PATH,
    map_location=device,
    weights_only=False
)

print("Checkpoint Keys:")
print(ckpt.keys())

config = ckpt["model_config"]

# ------------------------------------------------------------
# Model Definition
# ------------------------------------------------------------

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        return x * torch.rsqrt(
            x.pow(2).mean(-1, keepdim=True) + self.eps
        ) * self.weight


class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.n_head = config["n_head"]
        self.n_embd = config["n_embd"]
        self.head_dim = self.n_embd // self.n_head

        self.c_attn = nn.Linear(
            self.n_embd,
            3 * self.n_embd,
            bias=config["bias"]
        )

        self.c_proj = nn.Linear(
            self.n_embd,
            self.n_embd,
            bias=config["bias"]
        )

    def forward(self, x):
        B, T, C = x.shape

        q, k, v = self.c_attn(x).split(
            self.n_embd,
            dim=2
        )

        q = q.view(
            B, T, self.n_head, self.head_dim
        ).transpose(1, 2)

        k = k.view(
            B, T, self.n_head, self.head_dim
        ).transpose(1, 2)

        v = v.view(
            B, T, self.n_head, self.head_dim
        ).transpose(1, 2)

        y = F.scaled_dot_product_attention(
            q,
            k,
            v,
            is_causal=True
        )

        y = (
            y.transpose(1, 2)
            .contiguous()
            .view(B, T, C)
        )

        return self.c_proj(y)


class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.c_fc = nn.Linear(
            config["n_embd"],
            4 * config["n_embd"],
            bias=config["bias"]
        )

        self.c_proj = nn.Linear(
            4 * config["n_embd"],
            config["n_embd"],
            bias=config["bias"]
        )

        self.act = nn.GELU()

    def forward(self, x):
        return self.c_proj(
            self.act(self.c_fc(x))
        )


class Block(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.ln1 = RMSNorm(config["n_embd"])
        self.attn = CausalSelfAttention(config)

        self.ln2 = RMSNorm(config["n_embd"])
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.wte = nn.Embedding(
            config["vocab_size"],
            config["n_embd"]
        )

        self.wpe = nn.Embedding(
            config["block_size"],
            config["n_embd"]
        )

        self.blocks = nn.ModuleList([
            Block(config)
            for _ in range(config["n_layer"])
        ])

        self.ln_f = RMSNorm(config["n_embd"])

        self.lm_head = nn.Linear(
            config["n_embd"],
            config["vocab_size"],
            bias=False
        )

        self.wte.weight = self.lm_head.weight

    def forward(self, idx):
        B, T = idx.shape

        pos = torch.arange(
            T,
            device=idx.device
        )

        x = self.wte(idx) + self.wpe(pos)

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)

        logits = self.lm_head(x)

        return logits


# ------------------------------------------------------------
# Create model
# ------------------------------------------------------------

model = GPT(config).to(device)

# ------------------------------------------------------------
# Load weights
# ------------------------------------------------------------

if "model_state_dict" in ckpt:
    model.load_state_dict(ckpt["model_state_dict"])

elif "model" in ckpt:
    model.load_state_dict(ckpt["model"])

else:
    raise ValueError(
        f"Unknown checkpoint keys: {ckpt.keys()}"
    )

model.eval()

# ------------------------------------------------------------
# Tokenizer
# ------------------------------------------------------------

enc = tiktoken.get_encoding("gpt2")

# ------------------------------------------------------------
# Generate Function
# ------------------------------------------------------------

@torch.no_grad()
def generate(
    prompt,
    max_new_tokens=200,
    temperature=0.8,
    top_k=50
):
    formatted = (
        f"### Human: {prompt}\n"
        f"### Assistant:"
    )

    ids = enc.encode(formatted)

    x = torch.tensor(
        [ids],
        dtype=torch.long,
        device=device
    )

    for _ in range(max_new_tokens):

        logits = model(x[:, -2048:])

        logits = logits[:, -1, :] / temperature

        v, _ = torch.topk(logits, top_k)

        logits[
            logits < v[:, [-1]]
        ] = float("-inf")

        probs = F.softmax(
            logits,
            dim=-1
        )

        next_token = torch.multinomial(
            probs,
            num_samples=1
        )

        x = torch.cat(
            [x, next_token],
            dim=1
        )

        if next_token.item() == enc.eot_token:
            break

    text = enc.decode(
        x[0].tolist()
    )

    return (
        text.split("### Assistant:")[-1]
            .split("<|endoftext|>")[0]
            .strip()
    )

# ------------------------------------------------------------
# Test
# ------------------------------------------------------------

print(generate("Who created you?"))
print()
print(generate("What is machine learning?"))
print()
print(generate("what is the future of Ai?"))

Model Architecture

The model uses a standard GPT decoder-only transformer architecture with modern improvements:

class RMSNorm(nn.Module):
    """RMSNorm instead of LayerNorm β€” faster, no mean subtraction."""
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    def forward(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config["n_head"]
        self.n_embd = config["n_embd"]
        self.head_dim = self.n_embd // self.n_head
        self.c_attn = nn.Linear(self.n_embd, 3 * self.n_embd, bias=config["bias"])
        self.c_proj = nn.Linear(self.n_embd, self.n_embd, bias=config["bias"])
    def forward(self, x):
        B, T, C = x.shape
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        return self.c_proj(y.transpose(1, 2).contiguous().view(B, T, C))

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config["n_embd"], 4 * config["n_embd"], bias=config["bias"])
        self.c_proj = nn.Linear(4 * config["n_embd"], config["n_embd"], bias=config["bias"])
        self.act = nn.GELU()
    def forward(self, x):
        return self.c_proj(self.act(self.c_fc(x)))

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = RMSNorm(config["n_embd"])
        self.attn = CausalSelfAttention(config)
        self.ln2 = RMSNorm(config["n_embd"])
        self.mlp = MLP(config)
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.wte = nn.Embedding(config["vocab_size"], config["n_embd"])
        self.wpe = nn.Embedding(config["block_size"], config["n_embd"])
        self.blocks = nn.ModuleList([Block(config) for _ in range(config["n_layer"])])
        self.ln_f = RMSNorm(config["n_embd"])
        self.lm_head = nn.Linear(config["n_embd"], config["vocab_size"], bias=False)
        self.wte.weight = self.lm_head.weight  # weight tying
    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos = torch.arange(T, device=idx.device)
        x = self.wte(idx) + self.wpe(pos)
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
        return logits, loss

Checkpoint Format

The .pt file is a standard PyTorch checkpoint:

{
    "model_state_dict":     ...,   # model weights
    "optimizer_state_dict": ...,   # AdamW states
    "model_config": {
        "vocab_size":  50304,
        "n_layer":     24,
        "n_head":      16,
        "n_embd":      1024,
        "block_size":  2048,
        "dropout":     0.0,
        "bias":        False,
    },
    "step":          ...,
    "best_val_loss": ...,
}

Limitations

  • Not aligned for safety β€” no RLHF or DPO alignment has been applied
  • Factual accuracy β€” may hallucinate facts, especially specific numbers and dates
  • English only β€” trained primarily on English web text
  • Context length β€” limited to 2048 tokens
  • Small scale β€” 350M parameters is capable but limited compared to larger models

Training Infrastructure

  • Pretraining: RunPod A100 SXM 80GB (~$35 total compute cost)
  • Data pipeline: Google Colab (free tier) for download and tokenization
  • Framework: PyTorch 2.4, custom training loop (no HuggingFace Trainer)
  • Monitoring: Weights & Biases

About the Author

Islam Kathat β€” AI/ML Engineer & Full Stack Developer

This model was built as an independent research project to deeply understand the full LLM training pipeline β€” from raw data to a conversational model.


Citation

@misc{kathat2026llm350m,
  author    = {Islam Kathat},
  title     = {350M Parameter GPT Language Model},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/FazeFlynn/my-350M-LLM}
}

Intended Use

This model is intended for:

  • Research
  • Education
  • Learning about LLM training
  • Text generation
  • Conversational AI

This model is not intended for:

  • Medical advice
  • Legal advice
  • Financial decisions
  • Safety-critical systems

License

MIT License β€” free to use, modify, and distribute with attribution.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train FazeFlynn/my-350M-LLM