Instructions to use FazeFlynn/my-350M-LLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FazeFlynn/my-350M-LLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="FazeFlynn/my-350M-LLM")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FazeFlynn/my-350M-LLM", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use FazeFlynn/my-350M-LLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FazeFlynn/my-350M-LLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FazeFlynn/my-350M-LLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/FazeFlynn/my-350M-LLM
- SGLang
How to use FazeFlynn/my-350M-LLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FazeFlynn/my-350M-LLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FazeFlynn/my-350M-LLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FazeFlynn/my-350M-LLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FazeFlynn/my-350M-LLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use FazeFlynn/my-350M-LLM with Docker Model Runner:
docker model run hf.co/FazeFlynn/my-350M-LLM
Islam Kathat β 350M Parameter Language Model
A 350 million parameter GPT-style language model trained from scratch by Islam Kathat, an AI/ML Engineer and Full Stack Developer.
This model was built entirely scratch, architecture design, data pipeline, pretraining, and instruction tuning, as an independent AI/ML research project.
Model Details
| Property | Value |
|---|---|
| Parameters | 353.6M |
| Architecture | GPT (decoder-only transformer) |
| Layers | 24 |
| Attention heads | 16 |
| Hidden size | 1024 |
| Context length | 2048 tokens |
| Vocabulary | 50,304 (GPT-2 tokenizer, padded to multiple of 64) |
| Normalization | RMSNorm |
| Attention | Flash Attention (scaled dot-product) |
| Positional encoding | Learned positional embeddings |
| Weight tying | Input embedding β output projection |
Training
Phase 1 β Pretraining
| Property | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-10BT) |
| Tokens trained | 6.83 billion |
| Steps | 6,314 |
| Batch size | 512 sequences Γ 2048 tokens = ~1M tokens/step |
| Optimizer | AdamW (fused), Ξ²β=0.9, Ξ²β=0.95, weight decay=0.1 |
| Learning rate | 3e-4 β 3e-5 (cosine decay with linear warmup) |
| Precision | BF16 |
| Hardware | NVIDIA A100 SXM 80GB |
| Final val loss | ~3.04 |
| Final perplexity | ~20.9 |
Phase 2 β Instruction Tuning
| Property | Value |
|---|---|
| Dataset | OpenHermes 2.5 |
| Samples | 746,250 instruction-response pairs |
| Tokens | 288M |
| Steps | 8,000 |
| Learning rate | 1e-5 β 1e-6 (cosine) |
| Format | ### Human: {question}\n### Assistant: {answer}<|endoftext|> |
Phase 3 - Identity Tuning
| Property | Value |
|---|---|
| Dataset | Custom identity dataset |
| Examples | ~936 |
| Tokens | ~172K |
| Purpose | Teach model information about Islam Kathat and personal projects |
| Learning Rate | 1e-6 |
| Steps | 200 |
Examples
Prompt: what is Machine Learning?
Model: Machine learning is a subset of artificial intelligence with the objective of improving, verifying or optimizing systems for specific tasks using algorithms that generate unstructured, meaningful output.
Prompt: what is the future of AI?
Model: The future of AI is uncertain. We are seeing great advances in machine learning, artificial intelligence, robotics, and personalized medicine. And we will continue to learn, innovate, and adapt as our skills improve, but I donοΏ½t want to tell you that AI isnοΏ½t here yet β because it has been around for a long time.
Quick Start
# ============================================================
# Load and Run Islam Kathat's 350M LLM from Hugging Face
# ============================================================
# !pip install -q torch tiktoken huggingface_hub
import torch
import torch.nn as nn
import torch.nn.functional as F
import tiktoken
from huggingface_hub import hf_hub_download
# ------------------------------------------------------------
# Download checkpoint from Hugging Face
# ------------------------------------------------------------
MODEL_PATH = hf_hub_download(
repo_id="FazeFlynn/my-350M-LLM",
filename="llm-350m.pt"
)
# ------------------------------------------------------------
# Device
# ------------------------------------------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
# ------------------------------------------------------------
# Load checkpoint
# ------------------------------------------------------------
ckpt = torch.load(
MODEL_PATH,
map_location=device,
weights_only=False
)
print("Checkpoint Keys:")
print(ckpt.keys())
config = ckpt["model_config"]
# ------------------------------------------------------------
# Model Definition
# ------------------------------------------------------------
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
return x * torch.rsqrt(
x.pow(2).mean(-1, keepdim=True) + self.eps
) * self.weight
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.n_head = config["n_head"]
self.n_embd = config["n_embd"]
self.head_dim = self.n_embd // self.n_head
self.c_attn = nn.Linear(
self.n_embd,
3 * self.n_embd,
bias=config["bias"]
)
self.c_proj = nn.Linear(
self.n_embd,
self.n_embd,
bias=config["bias"]
)
def forward(self, x):
B, T, C = x.shape
q, k, v = self.c_attn(x).split(
self.n_embd,
dim=2
)
q = q.view(
B, T, self.n_head, self.head_dim
).transpose(1, 2)
k = k.view(
B, T, self.n_head, self.head_dim
).transpose(1, 2)
v = v.view(
B, T, self.n_head, self.head_dim
).transpose(1, 2)
y = F.scaled_dot_product_attention(
q,
k,
v,
is_causal=True
)
y = (
y.transpose(1, 2)
.contiguous()
.view(B, T, C)
)
return self.c_proj(y)
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(
config["n_embd"],
4 * config["n_embd"],
bias=config["bias"]
)
self.c_proj = nn.Linear(
4 * config["n_embd"],
config["n_embd"],
bias=config["bias"]
)
self.act = nn.GELU()
def forward(self, x):
return self.c_proj(
self.act(self.c_fc(x))
)
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln1 = RMSNorm(config["n_embd"])
self.attn = CausalSelfAttention(config)
self.ln2 = RMSNorm(config["n_embd"])
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.wte = nn.Embedding(
config["vocab_size"],
config["n_embd"]
)
self.wpe = nn.Embedding(
config["block_size"],
config["n_embd"]
)
self.blocks = nn.ModuleList([
Block(config)
for _ in range(config["n_layer"])
])
self.ln_f = RMSNorm(config["n_embd"])
self.lm_head = nn.Linear(
config["n_embd"],
config["vocab_size"],
bias=False
)
self.wte.weight = self.lm_head.weight
def forward(self, idx):
B, T = idx.shape
pos = torch.arange(
T,
device=idx.device
)
x = self.wte(idx) + self.wpe(pos)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.lm_head(x)
return logits
# ------------------------------------------------------------
# Create model
# ------------------------------------------------------------
model = GPT(config).to(device)
# ------------------------------------------------------------
# Load weights
# ------------------------------------------------------------
if "model_state_dict" in ckpt:
model.load_state_dict(ckpt["model_state_dict"])
elif "model" in ckpt:
model.load_state_dict(ckpt["model"])
else:
raise ValueError(
f"Unknown checkpoint keys: {ckpt.keys()}"
)
model.eval()
# ------------------------------------------------------------
# Tokenizer
# ------------------------------------------------------------
enc = tiktoken.get_encoding("gpt2")
# ------------------------------------------------------------
# Generate Function
# ------------------------------------------------------------
@torch.no_grad()
def generate(
prompt,
max_new_tokens=200,
temperature=0.8,
top_k=50
):
formatted = (
f"### Human: {prompt}\n"
f"### Assistant:"
)
ids = enc.encode(formatted)
x = torch.tensor(
[ids],
dtype=torch.long,
device=device
)
for _ in range(max_new_tokens):
logits = model(x[:, -2048:])
logits = logits[:, -1, :] / temperature
v, _ = torch.topk(logits, top_k)
logits[
logits < v[:, [-1]]
] = float("-inf")
probs = F.softmax(
logits,
dim=-1
)
next_token = torch.multinomial(
probs,
num_samples=1
)
x = torch.cat(
[x, next_token],
dim=1
)
if next_token.item() == enc.eot_token:
break
text = enc.decode(
x[0].tolist()
)
return (
text.split("### Assistant:")[-1]
.split("<|endoftext|>")[0]
.strip()
)
# ------------------------------------------------------------
# Test
# ------------------------------------------------------------
print(generate("Who created you?"))
print()
print(generate("What is machine learning?"))
print()
print(generate("what is the future of Ai?"))
Model Architecture
The model uses a standard GPT decoder-only transformer architecture with modern improvements:
class RMSNorm(nn.Module):
"""RMSNorm instead of LayerNorm β faster, no mean subtraction."""
def __init__(self, dim, eps=1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.n_head = config["n_head"]
self.n_embd = config["n_embd"]
self.head_dim = self.n_embd // self.n_head
self.c_attn = nn.Linear(self.n_embd, 3 * self.n_embd, bias=config["bias"])
self.c_proj = nn.Linear(self.n_embd, self.n_embd, bias=config["bias"])
def forward(self, x):
B, T, C = x.shape
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
return self.c_proj(y.transpose(1, 2).contiguous().view(B, T, C))
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config["n_embd"], 4 * config["n_embd"], bias=config["bias"])
self.c_proj = nn.Linear(4 * config["n_embd"], config["n_embd"], bias=config["bias"])
self.act = nn.GELU()
def forward(self, x):
return self.c_proj(self.act(self.c_fc(x)))
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln1 = RMSNorm(config["n_embd"])
self.attn = CausalSelfAttention(config)
self.ln2 = RMSNorm(config["n_embd"])
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.wte = nn.Embedding(config["vocab_size"], config["n_embd"])
self.wpe = nn.Embedding(config["block_size"], config["n_embd"])
self.blocks = nn.ModuleList([Block(config) for _ in range(config["n_layer"])])
self.ln_f = RMSNorm(config["n_embd"])
self.lm_head = nn.Linear(config["n_embd"], config["vocab_size"], bias=False)
self.wte.weight = self.lm_head.weight # weight tying
def forward(self, idx, targets=None):
B, T = idx.shape
pos = torch.arange(T, device=idx.device)
x = self.wte(idx) + self.wpe(pos)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.lm_head(x)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
return logits, loss
Checkpoint Format
The .pt file is a standard PyTorch checkpoint:
{
"model_state_dict": ..., # model weights
"optimizer_state_dict": ..., # AdamW states
"model_config": {
"vocab_size": 50304,
"n_layer": 24,
"n_head": 16,
"n_embd": 1024,
"block_size": 2048,
"dropout": 0.0,
"bias": False,
},
"step": ...,
"best_val_loss": ...,
}
Limitations
- Not aligned for safety β no RLHF or DPO alignment has been applied
- Factual accuracy β may hallucinate facts, especially specific numbers and dates
- English only β trained primarily on English web text
- Context length β limited to 2048 tokens
- Small scale β 350M parameters is capable but limited compared to larger models
Training Infrastructure
- Pretraining: RunPod A100 SXM 80GB (~$35 total compute cost)
- Data pipeline: Google Colab (free tier) for download and tokenization
- Framework: PyTorch 2.4, custom training loop (no HuggingFace Trainer)
- Monitoring: Weights & Biases
About the Author
Islam Kathat β AI/ML Engineer & Full Stack Developer
- GitHub: FazeFlynn
- LinkedIn: islam-khan
- Email: faiz.14a@gmail.com
This model was built as an independent research project to deeply understand the full LLM training pipeline β from raw data to a conversational model.
Citation
@misc{kathat2026llm350m,
author = {Islam Kathat},
title = {350M Parameter GPT Language Model},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/FazeFlynn/my-350M-LLM}
}
Intended Use
This model is intended for:
- Research
- Education
- Learning about LLM training
- Text generation
- Conversational AI
This model is not intended for:
- Medical advice
- Legal advice
- Financial decisions
- Safety-critical systems
License
MIT License β free to use, modify, and distribute with attribution.