Jaward Sesay

Jaward

AI & ML interests

I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy

Articles

Organizations

Posts 23

view post
Post
1286
mlx_micrograd - mlx port of Karpathy's micrograd- a tiny scalar-valued autograd engine with a small PyTorch-like neural network library on top.

https://github.com/Jaykef/mlx_micrograd
Installation
pip install mlx_micrograd

Example usage
Example showing a number of possible supported operations:
from mlx_micrograd.engine import Value

a = Value(-4.0)
b = Value(2.0)
c = a + b
d = a * b + b**3
c += c + 1
c += 1 + c + (-a)
d += d * 2 + (b + a).relu()
d += 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g += 10.0 / f
print(f'{g.data}') # prints array(24.7041, dtype=float32), the outcome of this forward pass
g.backward()
print(f'{a.grad}') # prints array(138.834, dtype=float32), i.e. the numerical value of dg/da
print(f'{b.grad}') # prints array(645.577, dtype=float32), i.e. the numerical value of dg/db

view post
Post
2060
# Thoughts on Neural Scaling Laws
When you take a zoomed-out perspective view on the success goals of neural networks, you see they all revolve around the Scaling Laws - empirical observations that performance improves with increased model size, dataset, and compute resources.

The specifics of how these laws apply, vary for different modalities and architectures. This is notable in the empirical equations used to measure these laws.

Yet they all heavily rely on three main factors - Data, Size and Computation. These factors themselves also have sub-dependencies - data size & quality, model size & architecture, num of GPUs & code for compute kernels respectively.

As research in these laws progresses, we begin to see new scaling laws emerge that may apply in much different ways than usual. This is typical in recent local LLMs (Phi-3, Gemma 2B, LLMs in a flash) which shows small sized models with small rich quality data beating large models

I look forward to the singularity moment - when these laws take a full round spin and meet at where it all began:)

References:
- Scaling Laws for Neural Language Models: https://arxiv.org/pdf/2001.08361
- Scaling Laws for Autoregressive Generative Modeling: https://arxiv.org/abs/2010.14701
- LLMs in a flash: https://arxiv.org/abs/2312.11514
- Phi-3 Technical Report: https://arxiv.org/abs/2404.14219
- Gemma 2B: https://arxiv.org/pdf/2403.08295

datasets

None public yet