DL From Scratch
Implement mainstream deep learning models from scratch.
Project Structure
βββ main.py
βββ pyproject.toml
βββ .gitignore
βββ README.md
βββ resnet18/
β βββ README.md # Original ResNet18 implementation
β βββ __init__.py
β βββ data.py # CelebA via HF datasets (eurecom-ds/celeba)
β βββ model.py # ResNet18 from scratch
β βββ train.py # Training script (MPS + AMP)
β βββ eval.py # Evaluation script (per-attribute accuracy)
β βββ resnet18_celeba.pt # (local, not tracked)
βββ resnet34/
β βββ __init__.py
β βββ data.py # Full CelebA (40 attrs, 200K), data augmentation
β βββ model.py # ResNet34 via resnet18.model.ResNet
β βββ train.py # SGD+Momentum + CosineAnnealingLR + grad accum + early stopping
β βββ eval.py # Per-attribute ROC AUC, F1, test split
βββ resnet50/
β βββ __init__.py
β βββ config.yaml # ResNet50 hyperparameters
β βββ model.py # Bottleneck block (1Γ1β3Γ3β1Γ1) β resnet50()
β βββ data.py # Reuses resnet34.data (CelebA)
β βββ train.py # Reuses resnet34 training pattern
β βββ eval.py # Per-attribute ROC AUC, F1
βββ vae/
β βββ __init__.py
β βββ config.yaml # VAE hyperparameters
β βββ model.py # Encoder β ΞΌ,logΟΒ² β reparameterize β Decoder
β βββ data.py # CelebA images (64Γ64)
β βββ train.py # VAE training (recon + KL loss)
β βββ generate.py # Sample generation + latent interpolation
βββ ddpm/
β βββ __init__.py
β βββ config.yaml # DDPM hyperparameters
β βββ model.py # UNet + timestep embedding + DDPM forward/sample
β βββ data.py # CIFAR-10 via HF datasets
β βββ train.py # Noise prediction training
β βββ generate.py # Reverse diffusion sampling
βββ gcn/
β βββ __init__.py
β βββ config.yaml # GCN hyperparameters
β βββ model.py # Graph Convolution layers + 2-layer GCN
β βββ data.py # Cora citation network loader
β βββ train.py # Semi-supervised node classification
β βββ eval.py # Test accuracy evaluation
βββ dqn/
β βββ __init__.py
β βββ config.yaml # DQN hyperparameters
β βββ dqn.py # DQN, ReplayBuffer, train_episode helpers
β βββ train.py # CartPole RL training loop
βββ simclr/
β βββ __init__.py
β βββ config.yaml # SimCLR hyperparameters
β βββ model.py # ResNet18 encoder + Projector + NT-Xent loss
β βββ data.py # CIFAR-10 with dual augmentation
β βββ train.py # Contrastive learning training
βββ yolo/
β βββ __init__.py
β βββ config.yaml # YOLO hyperparameters
β βββ model.py # CNN backbone + detection head
β βββ loss.py # YOLO loss + NMS
β βββ data.py # Pascal VOC dataset
β βββ train.py # Object detection training
βββ dcgan/
β βββ __init__.py
β βββ config.yaml # DCGAN hyperparameters
β βββ model.py # Generator + Discriminator
β βββ data.py # CelebA images (64Γ64, no labels)
β βββ train.py # Adversarial training loop (G/D alternating)
β βββ generate.py # Generate sample grid from trained model
βββ vit/
β βββ __init__.py
β βββ config.yaml # ViT hyperparameters (patch_size, d_model, n_layers, etc.)
β βββ model.py # ViT: PatchEmbed β Transformer encoder (reused from BERT) β CLS head
β βββ data.py # CIFAR-10 via HF datasets
β βββ train.py # Training loop
β βββ eval.py # Per-class accuracy on test split
βββ unet/
β βββ __init__.py
β βββ config.yaml # UNet hyperparameters
β βββ model.py # U-Net: encoderβdecoder with skip connections
β βββ data.py # Oxford-IIIT Pet (image + mask) with augmentation
β βββ train.py # Training loop (pixel-wise CrossEntropy)
β βββ eval.py # IoU and pixel accuracy
βββ cnn/
β βββ __init__.py
β βββ data.py # CIFAR-10 via HF datasets (uoft-cs/cifar10)
β βββ model.py # Plain CNN (ConvΓ3 + PoolΓ3 + FCΓ2)
β βββ train.py # Training script (Adam + CosineAnnealingLR)
β βββ eval.py # Test evaluation + confusion matrix
βββ mlp/
β βββ __init__.py
β βββ data.py # MNIST via HF datasets (ylecun/mnist)
β βββ model.py # MLP β pure NumPy (Linear, ReLU, SoftmaxCrossEntropy, SGD)
β βββ train.py # Training script
β βββ eval.py # Test evaluation (per-digit accuracy)
βββ utils/
β βββ __init__.py
β βββ config.py # YAML config loading/saving (load_config / save_config)
β βββ seed.py # set_seed() β lock torch + numpy + random + cudnn
βββ nlp/
β βββ bert/
β βββ word2vec/
β βββ lstm/
β βββ gpt/
β βββ seq2seq/
β βββ __init__.py
β βββ tokenizer.py # Word-level tokenizer (5000 vocab, from text8)
β βββ model.py # Decoder-only Transformer (Causal Attention + KV Cache)
β βββ train.py # Autoregressive LM on text8
β βββ generate.py # Text generation (temperature + top-k + [SEP] blocked)
β βββ seq2seq/
β βββ __init__.py
β βββ config.yaml # Transformer hyperparameters
β βββ model.py # Encoder (from BERT) + Decoder (cross-attention) β Seq2Seq
β βββ data.py # Multi30k ENβDE, word-level tokenizer
β βββ train.py # Teacher forcing training
β βββ generate.py # Greedy decoding translation demo
βββ basics/
β βββ __init__.py
β βββ logistic_regression.py # Single Linear layer + Softmax (92.3% on MNIST)
β βββ linear_regression.py # California Housing (Normal Equation + GD, RΒ²=0.583)
β βββ k_means.py # Unsupervised clustering (pure NumPy)
β βββ svm.py # SVM β GD (primal) + SMO (dual, Linear/RBF kernels)
β βββ decision_tree.py # ID3/CART on Iris (ASCII tree, ~93% acc)
β βββ naive_bayes.py # Gaussian NB on MNIST (generative classifier)
β βββ pca.py # SVD-based dimensionality reduction (MNIST 2D visualisation)
β βββ knn.py # k-Nearest Neighbors (instance-based, MNIST)
β βββ perceptron.py # Single neuron (Rosenblatt 1958, step activation)
βββ .gitattributes # LFS: *.zip *.pt
βββ uv.lock
Infrastructure
| Feature | Description |
|---|---|
| Config system | Each model directory has a config.yaml with its hyperparameters (seed, lr, batch_size, epochs, etc.). Edit the YAML to change training params without touching code. |
| TensorBoard | Every PyTorch training script logs loss/accuracy per epoch to runs/{model_name}/. Run tensorboard --logdir runs to visualize all experiments. |
| Reproducibility | utils/seed.py provides set_seed() that locks torch + numpy + random + cudnn. Called at the start of every train script. Config is saved alongside model weights (_config.yaml). |
Usage
# View training curves (all models)
tensorboard --logdir runs
# Edit hyperparameters in YAML instead of code
vim resnet18/config.yaml
# then train as usual:
uv run python -m resnet18.train
ResNet18
| Item | Value |
|---|---|
| Model | ResNet18 (11.2M params) |
| Dataset | CelebA via HF datasets β 1,000 images |
| Attributes | 15 binary (Smiling, Male, Young, Eyeglasses, etc.) |
| Split | 800 train / 200 val |
| Val Accuracy | 91.2% |
| Training | MPS (Mac M4) + AMP |
ResNet34
| Item | Value |
|---|---|
| Model | ResNet34 (~21M params, [3,4,6,3] BasicBlock) |
| Dataset | CelebA via HF datasets β full 200K |
| Attributes | All 40 binary attributes |
| Optimizer | SGD + Momentum (0.9, weight_decay=1e-4) |
| Training | CosineAnnealingLR + Gradient Accumulation + Early Stopping + Loss Weighting |
ResNet50
| Item | Value |
|---|---|
| Model | ResNet50 (~23.6M params, [3,4,6,3] Bottleneck) |
| Dataset | CelebA via HF datasets β full 200K |
| Attributes | All 40 binary attributes |
| Optimizer | SGD + Momentum (0.9, weight_decay=1e-4) |
| Architecture | Bottleneck block: 1Γ1 β 3Γ3 β 1Γ1 (contrast with BasicBlock's two 3Γ3) |
VAE
| Item | Value |
|---|---|
| Model | Variational Autoencoder (2.6M params) |
| Dataset | CelebA via HF datasets β 10K images (64Γ64) |
| Architecture | Conv Encoder β ΞΌ,logΟΒ² β reparameterize β Deconv Decoder β Sigmoid |
| Loss | Reconstruction (BCE) + KL divergence |
| Training | Adam(lr=2e-4), 50 epoch |
Seq2Seq Transformer
| Item | Value |
|---|---|
| Model | Encoder-Decoder Transformer (1M params) |
| Dataset | Multi30k ENβDE β 29K train / 1K test |
| Architecture | Encoder (from BERT) + Decoder (causal + cross-attention) |
| Training | Teacher forcing, weight-tying, Adam(lr=1e-4) |
DDPM
| Item | Value |
|---|---|
| Model | Denoising Diffusion (16.1M params) |
| Dataset | CIFAR-10 via HF datasets β 50K images (32Γ32) |
| Architecture | UNet + timestep embedding + sinusoid positional encoding |
| Training | Noise prediction (MSE), T=1000, linear Ξ² schedule |
| Sampling | Reverse diffusion (x_T β x_0), 1000 steps |
GCN
| Item | Value |
|---|---|
| Model | 2-layer Graph Convolutional Network (23K params) |
| Dataset | Cora via URL β 2708 nodes, 1433 features, 7 classes |
| Architecture | GraphConv Γ 2: Γ @ H @ W (spectral graph convolution) |
| Training | Semi-supervised (20 labels/class), CrossEntropyLoss |
DQN
| Item | Value |
|---|---|
| Model | Deep Q-Network (17K params) |
| Environment | CartPole-v1 via Gymnasium β 4-dim state, 2 actions |
| Architecture | 3-layer MLP (4β128β128β2) |
| Training | Experience replay, target network, Ξ΅-greedy decay |
SimCLR
| Item | Value |
|---|---|
| Model | SimCLR (11M params: ResNet18 encoder + MLP projector) |
| Dataset | CIFAR-10 via HF datasets β self-supervised (no labels) |
| Architecture | ResNet18 β Projector(512β256β128) β NT-Xent loss |
| Training | 100 epoch, temperature=0.5, dual random augmentation |
YOLO
| Item | Value |
|---|---|
| Model | Simplified YOLO (59M params) |
| Dataset | Pascal VOC via HF datasets β 20 classes |
| Architecture | CNN backbone β FC detection head β 7Γ7Γ30 output |
| Training | YOLO loss (coord + obj + noobj + class), NMS at inference |
DCGAN
| Item | Value |
|---|---|
| Model | Generator (3.5M params) + Discriminator (2.8M params) |
| Dataset | CelebA via HF datasets β 10K images (64Γ64) |
| Architecture | Transposed conv G / Conv D, BN, LeakyReLU |
| Optimizer | Adam(lr=2e-4, Ξ²β=0.5) β separate for G and D |
| Training | BCELoss, label smoothing, fixed noise grid for monitoring |
ViT
| Item | Value |
|---|---|
| Model | Vision Transformer (807K params, 4 layers, 4 heads, 128-dim) |
| Dataset | CIFAR-10 via HF datasets β 50K train / 10K test |
| Architecture | PatchEmbed(4Γ4) β [CLS] β Transformer Encoder (from BERT) β CLS head |
| Key concept | Self-attention for vision, no convolutions, patch embeddings |
UNet
| Item | Value |
|---|---|
| Model | U-Net (31M params, 5 encoder/decoder stages) |
| Dataset | Oxford-IIIT Pet via HF datasets β image + segmentation mask |
| Architecture | Encoder: Conv+MaxPool Γ 4, Decoder: UpConv+skip Γ 4, output: pixel-wise logits |
| Loss | CrossEntropy (ignore_index=0 for unlabeled) |
| Metrics | Pixel accuracy, mean IoU |
CNN
| Item | Value |
|---|---|
| Model | SimpleCNN (620K params) |
| Dataset | CIFAR-10 via HF datasets β 50K images |
| Classes | 10 (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) |
| Test Accuracy | 82.4% (30 epochs) |
| Training | Adam + CosineAnnealingLR |
MLP
| Item | Value |
|---|---|
| Model | MLP (235K params, pure NumPy) |
| Dataset | MNIST via HF datasets β 60K images |
| Classes | 10 digits (0-9) |
| Test Accuracy | 97.9% (20 epochs) |
| Framework | NumPy only (hand-written backward pass) |
BERT
| Item | Value |
|---|---|
| Model | BERT mini (834K params, 4 layers, 4 heads, 128-dim) |
| Pre-training | MLM on text8 (90M chars, HuggingFace) |
| Fine-tuning | Sentiment classification on IMDB (HuggingFace) |
| Test Accuracy | ~50% (character-level; word-level would be higher with subword tokenization) |
| Core components | Self-Attention (semantic aggregation) + MLM (entropy increase noise reduction) |
Word2Vec
| Item | Value |
|---|---|
| Model | Word2Vec (50-dim embeddings, 97K vocab) |
| Architectures | CBOW + Skip-gram with Negative Sampling |
| Dataset | text8 via HF datasets (~90M chars) |
| Training | Adam, 5 epochs, k=5 negative samples |
| Evaluation | Cosine similarity search in embedding space |
| Key concept | Static word embeddings from distributional semantics |
LSTM
| Item | Value |
|---|---|
| Model | LSTM (145K params, hand-written gates) |
| Dataset | IMDB via HuggingFace (9K train / 1K test) |
| Architecture | Embedding(128) β LSTM(128β128) β FC(128β2) |
| Test Accuracy | ~50-60% (character-level, harder than word-level) |
| Key concepts | Input/forget/output gates, cell state, gradient flow through gating |
GPT
| Item | Value |
|---|---|
| Model | Decoder-only Transformer (5.7M params, word-level) |
| Dataset | text8 via HuggingFace (15M words, 20K chunks) |
| Training | Autoregressive (predict next token), PPL 4.63 |
| Generation | Temperature + top-k sampling with KV Cache, [SEP] blocked |
| Key concepts | Causal Self-Attention, KV Cache, autoregressive generation, word-level tokenization |
Basics
| Algorithm | File | Datasets | Metric |
|---|---|---|---|
| Logistic Regression | basics/logistic_regression.py |
MNIST | 92.3% test accuracy |
| Linear Regression | basics/linear_regression.py |
California Housing | RΒ²=0.583 |
| K-Means | basics/k_means.py |
MNIST | 57.8% cluster purity |
| SVM (GD + SMO) | basics/svm.py |
MNIST 3v5 | 93.3% (RBF kernel) |
| Decision Tree | basics/decision_tree.py |
Iris | 93.3% test acc |
| Naive Bayes | basics/naive_bayes.py |
MNIST | 53.0% test acc |
| PCA | basics/pca.py |
MNIST | 17.3% variance in 2 components |
| k-NN | basics/knn.py |
MNIST | ~87% (k=5, 2000 train) |
| Perceptron | basics/perceptron.py |
MNIST 0v1 | 100% (linearly separable) |
SVM implementations
| Method | Type | Kernel | Notes |
|---|---|---|---|
SVM_GD |
Primal GD | Linear only | Fast, robust, ~80 lines |
SVM_SMO |
Dual SMO | Linear + RBF | Platt SMO, ~150 lines, supports kernel trick |
See resnet18/README.md for details.
Core Concepts
Every model in this project was written from scratch to teach a specific set of ML/DL concepts. The table below maps each model to the key ideas it demonstrates.
| Module | Model | Key concepts |
|---|---|---|
basics/ |
Logistic Regression | Linear decision boundary, Softmax, Cross-Entropy, closed-form vs gradient descent |
basics/ |
Linear Regression | Normal Equation, MSE, RΒ² score, feature standardisation |
basics/ |
K-Means | Unsupervised learning, Euclidean distance, iterative centroid refinement, cluster purity |
basics/ |
SVM (GD) | Hinge loss, max-margin classification, L2 regularisation, primal gradient descent |
basics/ |
SVM (SMO) | Dual formulation, Lagrange multipliers, KKT conditions, kernel trick (RBF) |
basics/ |
Decision Tree | Entropy, Information Gain, recursive partitioning, interpretable ASCII tree |
basics/ |
Naive Bayes | Bayes' theorem, generative vs discriminative models, Gaussian likelihood, log-space prediction |
basics/ |
PCA | Singular Value Decomposition (SVD), eigenvalue, dimensionality reduction, variance explained |
basics/ |
k-NN | Instance-based learning, distance metrics, curse of dimensionality, bias-variance tradeoff |
basics/ |
Perceptron | Single neuron, step activation, online learning, Perceptron Convergence Theorem |
mlp/ |
MLP (NumPy) | Manual backpropagation, chain rule, gradient descent without autograd, softmax cross-entropy |
cnn/ |
SimpleCNN | Convolution, max-pooling, BatchNorm, Dropout, CosineAnnealing LR schedule |
resnet18/ |
ResNet18 | Residual connections (skip connections), BatchNorm in deep networks, bottleneck design, AMP |
resnet34/ |
ResNet34 | SGD+Momentum, CosineAnnealingLR, gradient accumulation, early stopping, ROC AUC, F1 |
resnet50/ |
ResNet50 | Bottleneck block (1Γ1β3Γ3β1Γ1), deeper residual networks |
vae/ |
VAE | Reparameterization trick, KL divergence, latent space interpolation |
nlp/seq2seq/ |
Seq2Seq Transformer | Encoder-decoder, cross-attention, teacher forcing, weight-tying |
ddpm/ |
DDPM | Denoising Diffusion, UNet + timestep embedding, noise prediction |
dcgan/ |
DCGAN | Transposed convolution, adversarial training, generator/discriminator dynamics |
vit/ |
Vision Transformer (ViT) | Patch embedding, self-attention for vision, Transformer without convolutions |
unet/ |
U-Net | Encoder-decoder, skip connections, pixel-wise classification, IoU metric |
nlp/bert/ |
BERT mini | Self-Attention (semantic aggregation), Masked Language Model (entropy increase + denoising), LayerNorm, positional encoding |
nlp/word2vec/ |
Word2Vec | Embedding lookup tables, Negative Sampling, CBOW vs Skip-gram, subsampling frequent words, cosine similarity |
nlp/lstm/ |
LSTM | Input/forget/output gates, cell state, gradient flow through gating, sequential processing vs parallel attention |
gcn/ |
GCN | Graph convolution, message passing, semi-supervised node classification |
dqn/ |
DQN | Q-Learning, experience replay, target network, Ξ΅-greedy |
simclr/ |
SimCLR | Contrastive learning, NT-Xent loss, data augmentation |
yolo/ |
YOLO | Single-stage object detection, grid-based regression, NMS |
nlp/gpt/ |
GPT | Causal Self-Attention, KV Cache, autoregressive generation, word-level tokenizer, temperature + top-k sampling, bad-token blocking |
Setup & Run
uv sync
# Train / Evaluate ResNet18
uv run python -m resnet18.train
uv run python -m resnet18.eval
# Train / Evaluate ResNet34
uv run python -m resnet34.train
uv run python -m resnet34.eval
# Train / Evaluate ResNet50
uv run python -m resnet50.train
uv run python -m resnet50.eval
# Train / Generate VAE
uv run python -m vae.train
uv run python -m vae.generate
# Train / Translate Seq2Seq
uv run python -m nlp.seq2seq.train
uv run python -m nlp.seq2seq.generate
# Train / Evaluate GCN
uv run python -m gcn.train
uv run python -m gcn.eval
# Train DQN
uv run python -m dqn.train
# Train SimCLR
uv run python -m simclr.train
# Train YOLO
uv run python -m yolo.train
# Train / Generate DDPM
uv run python -m ddpm.train
uv run python -m ddpm.generate
# Train / Generate DCGAN
uv run python -m dcgan.train
uv run python -m dcgan.generate
# Train / Evaluate ViT
uv run python -m vit.train
uv run python -m vit.eval
# Train / Evaluate UNet
uv run python -m unet.train
uv run python -m unet.eval
# Train / Evaluate CNN
uv run python -m cnn.train
uv run python -m cnn.eval
# Train / Evaluate MLP (pure NumPy)
uv run python -m mlp.train
uv run python -m mlp.eval
# Basics
uv run python -m basics.logistic_regression
uv run python -m basics.k_means
uv run python -m basics.linear_regression
uv run python -m basics.svm
uv run python -m basics.decision_tree
uv run python -m basics.naive_bayes
uv run python -m basics.pca
uv run python -m basics.knn
uv run python -m basics.perceptron
# NLP
uv run python -m nlp.bert.pretrain
uv run python -m nlp.bert.finetune
uv run python -m nlp.bert.eval
# Word2Vec
uv run python -m nlp.word2vec.train
uv run python -m nlp.word2vec.eval
# LSTM
uv run python -m nlp.lstm.train
uv run python -m nlp.lstm.eval
# GPT
uv run python -m nlp.gpt.train
uv run python -m nlp.gpt.generate
Models
Trained weights are not tracked in git (.gitignore'ed). Each model saves its weights
locally after training; paths are shown below for reference.
| Model | Local path | Size |
|---|---|---|
| ResNet18 (15 attrs, 1K samples) | resnet18/resnet18_celeba.pt |
45 MB |
| ResNet34 (40 attrs, 200K samples) | resnet34/resnet34_celeba.pt |
~80 MB |
| ResNet50 (40 attrs, 200K samples) | resnet50/resnet50_celeba.pt |
~90 MB |
| VAE (CelebA, 64Γ64) | vae/vae_celeba.pt |
10 MB |
| Seq2Seq Transformer (Multi30k) | nlp/seq2seq/seq2seq_multi30k.pt |
4 MB |
| GCN (Cora) | gcn/gcn_cora.pt |
0.1 MB |
| DQN (CartPole) | dqn/dqn_cartpole.pt |
0.07 MB |
| SimCLR (CIFAR-10) | simclr/simclr_cifar10.pt |
22 MB |
| YOLO (Pascal VOC) | yolo/yolo_voc.pt |
226 MB |
| DDPM (CIFAR-10, 32Γ32) | ddpm/ddpm_cifar10.pt |
62 MB |
| DCGAN (CelebA, 64Γ64) | dcgan/dcgan_celeba.pt |
~23 MB (G+D) |
| ViT (CIFAR-10, 32Γ32) | vit/vit_cifar10.pt |
3.2 MB |
| UNet (Oxford-Pet, 128Γ128) | unet/unet_oxford_pet.pt |
119 MB |
| SimpleCNN (CIFAR-10) | cnn/simple_cnn_cifar10.pt |
2.4 MB |
| MLP (MNIST, NumPy) | mlp/mlp_mnist.npz |
0.9 MB |
| Logistic Regression | basics/logistic_regression.npz |
63 KB |
| K-Means centers | basics/kmeans_centers.npz |
32 KB |
| Linear Regression | basics/linear_regression.npz |
2 KB |
| SVM | basics/svm.npz |
45 KB |
| Decision Tree | β | N/A (no weights) |
| Naive Bayes | β | N/A (no weights) |
| PCA | β | N/A (data-dependent) |
| k-NN | β | N/A (no training) |
| Perceptron | β | N/A (no weights) |
| BERT (MLM) | nlp/bert/bert_mlm.pt |
3.2 MB |
| BERT (finetuned) | nlp/bert/bert_finetuned.pt |
3.2 MB |
| Word2Vec (SG) | nlp/word2vec/skipgram.pt |
19 MB |
| Word2Vec (CBOW) | nlp/word2vec/cbow.pt |
19 MB |
| LSTM | nlp/lstm/lstm_sentiment.pt |
0.6 MB |
| GPT | nlp/gpt/gpt_text8.pt |
3.3 MB |
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support