Jaward Sesay

Jaward

AI & ML interests

I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy

Articles

Organizations

Jaward's activity

posted an update 4 days ago
view post
Post
630
When untrained tokens play "catch me if you can" the Fishing For Margikarp paper is the detective:)
The playbook:
- Inspect token vocab & study encode/decode pattern.
- Brute-force on architecture-dependent indicators (same matrix in token embeddings and final layer) to identify untrained tokens.
- Then verify if identified tokens are out of distribution by prompting a target llm (with no tied threshold).

Quite a bait huh, Cohere:)

Paper: https://arxiv.org/pdf/2405.05417
Code: https://github.com/cohere-ai/magikarp
replied to their post 6 days ago
view reply

Okay GPT-4o just helped me beat karpathy's minbpe train speed 1.2x faster in one shot - can finally agree on the "o" meaning "omni":)

Improvements

  • efficient merging and getstats: got rid of redundancy in computings merge and getstats

posted an update 6 days ago
view post
Post
1348
Build your own GPT-4 Tokenizer! - @karpathy 's minbpe exercise.
Step 1: BasicTokenizer
Got "close" to beating minbpe's train speed :(
step 2 RegexTokenizer coming soon.

Notes on lessons learned:
- tokenization is the assembly language of LLMs:)
It's not a healthy choice to code it lol.
- encoding can literally drive you mad.
- merging is where sh*t gets real - moment of truth:)
- training requires precision.
- decoding is trivial.
  • 1 reply
·
posted an update 14 days ago
view post
Post
1755
mlx_micrograd - mlx port of Karpathy's micrograd- a tiny scalar-valued autograd engine with a small PyTorch-like neural network library on top.

https://github.com/Jaykef/mlx_micrograd
Installation
pip install mlx_micrograd

Example usage
Example showing a number of possible supported operations:
from mlx_micrograd.engine import Value

a = Value(-4.0)
b = Value(2.0)
c = a + b
d = a * b + b**3
c += c + 1
c += 1 + c + (-a)
d += d * 2 + (b + a).relu()
d += 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g += 10.0 / f
print(f'{g.data}') # prints array(24.7041, dtype=float32), the outcome of this forward pass
g.backward()
print(f'{a.grad}') # prints array(138.834, dtype=float32), i.e. the numerical value of dg/da
print(f'{b.grad}') # prints array(645.577, dtype=float32), i.e. the numerical value of dg/db

posted an update 16 days ago
view post
Post
2428
# Thoughts on Neural Scaling Laws
When you take a zoomed-out perspective view on the success goals of neural networks, you see they all revolve around the Scaling Laws - empirical observations that performance improves with increased model size, dataset, and compute resources.

The specifics of how these laws apply, vary for different modalities and architectures. This is notable in the empirical equations used to measure these laws.

Yet they all heavily rely on three main factors - Data, Size and Computation. These factors themselves also have sub-dependencies - data size & quality, model size & architecture, num of GPUs & code for compute kernels respectively.

As research in these laws progresses, we begin to see new scaling laws emerge that may apply in much different ways than usual. This is typical in recent local LLMs (Phi-3, Gemma 2B, LLMs in a flash) which shows small sized models with small rich quality data beating large models

I look forward to the singularity moment - when these laws take a full round spin and meet at where it all began:)

References:
- Scaling Laws for Neural Language Models: https://arxiv.org/pdf/2001.08361
- Scaling Laws for Autoregressive Generative Modeling: https://arxiv.org/abs/2010.14701
- LLMs in a flash: https://arxiv.org/abs/2312.11514
- Phi-3 Technical Report: https://arxiv.org/abs/2404.14219
- Gemma 2B: https://arxiv.org/pdf/2403.08295
posted an update 18 days ago
view post
Post
1484
When I read the KAN paper, I see physicists casually making fun of the uncertainties in MLPs or Neural nets as a whole:

- "The philosophy here is close to the mindset of physicists, who often care more about typical cases rather than worst cases" lol this went hard on NNs

- "Finite grid size can approximate the function well with a residue rate independent of the dimension, hence beating curse of dimensionality!" haha.

- "Neural scaling laws are the phenomenon where test loss decreases with more model parameters"

- "Our approach, which assumes the existence of smooth Kolmogorov Arnold representations, decomposes the high-dimensional function into several 1D functions"

Key Differences With MLPs:
- Activation Functions: Unlike MLPs that use fixed activation functions at the nodes, KANs utilize learnable activation functions located on the edges between nodes.
- Weight Parameters: In KANs, traditional linear weight matrices are absent. Instead, each weight parameter is replaced by a learnable univariate function, specifically a spline.
- Summation Nodes: Nodes in KANs perform simple summation of incoming signals without applying non-linear transformations.

Advantages Over MLPs:
- Accuracy: achieve higher accuracy with smaller network sizes compared to larger MLPs in tasks like data fitting and solving partial differential equations (PDEs).
- Interpretability: Due to their unique structure, KANs are more interpretable than MLPs.

Technical Innovations:
- Learnable Edges: learnable functions on network edges presents a novel approach to network design, providing greater flexibility in modeling complex relationships in data.
- No Linear Weights: elimination of linear weights reduces the parameters, and potentially simplifies the learning process, focusing on the optimization of univariate function representations.

Applications and Practical Use:
- Scientific Collaboration: KANs have been applied in scientific settings as tools to help discover or rediscover math
  • 1 reply
·
posted an update 19 days ago
view post
Post
1715
It’s exciting to see Apple’s commitment to opensource AI research lately. From a new awesome machine learning framework (mlx) to a family of purely open models (openELM) and incredibly visionary papers (LLMs in a flash, MM1) not mention the vibrant OSS community behind mlx - All alpha signs of something huge dropping in this year’s #AppleEvent & #WWDC
replied to their post 22 days ago
view reply

Over 400 downloads already🎉

  • small yet very capable, lightweight, runs at light speed with mlx/llama.cpp

Image 2024-4-28 at 07.32.jpeg

posted an update 22 days ago
replied to their post 24 days ago
view reply

yh, too bad It was unable to run tests since mlx is apple silicon only and devin’s dev env is linux, it wrote the port code tho, will have to test on my mac:)

posted an update 24 days ago
view post
Post
1948
Today’s devin’s most difficult task:
build a port of our AutoAgents framework in mlx and develop a demo using a gguf weight - it got close to nailing it (with guidance).

It was magical to witness. I had to take the wheel and help fix some subtle bugs. That said there was still the need for a human software engineer to keep it aligned with the overall goal. Most of my work involved reviewing code, checking shells and alignment chats.

full demo coming soon.

AutoAgents: LinkSoul/AutoAgents
  • 2 replies
·
replied to their post 25 days ago
view reply

haven't completed yet, need to do some refactoring. I will share when it's ready.

posted an update 25 days ago
view post
Post
1790
Got access to Devin today and boy it’s been rocking it - 10x engineer on pure software dev tasks, albeit falls at the mercy of ML/AI tasks. Still a promising work of daring-engineering feat, wishing all the best to the team @cognition_labs
·
replied to their post 26 days ago
replied to their post 26 days ago
replied to their post 26 days ago
view reply

the paper mentioned the 4-bit quantized can occupy ~ 1.8GB on the iphone, so it will probably be less than 2GB.

posted an update 26 days ago
view post
Post
3791
All You need To Know About Phi-3 (Technical Report Walkthrough)

Summary of Summaries:
Phi-3-mini
- Architecture specs: decoder-only transformer, ModelSize: 3.8 billion
parameters, LongRope [ 128K Context length ], Vocab Size [ 32064 ],
trained on 3.3 trillion tokens. at bfloat16.
- Rivals performance to larger models like Mixtral 8x7B and GPT-3.5,
capable of running locally on a smartphone.
- Utilizes high quality training dataset heavily filtered from web data and
llm-generated synthetic data.
- Can be quantized to 4-bits, occupying ≈ 1.8GB of memory.
- Ran natively on iPhone 14 with A16 Bionic chip with inference speed of up
to 12 tokens per second.

Phi-3-small
- Architecture specs: Also decoder-only, 7B parameters, Vocab size [ 100352 ], default context length [ 8k ], Context Length: 8K, Hidden Dimension: 4096, Number of Heads and Layers: Follows 7B class structure.
- Uses tiktoken tokenizer (for enhanced multilingual tokenization)

Phi-3-medium:
- Architecture specs: Also decoder-only, Hidden Dimension: 5120, Number of Heads: 40, Number of Layers: 40, Tokenization: Consistent with other models, Training on 4.8 trillion tokens.

Training Methodology:
- Focuses on high-quality training data deviating from standard scaling laws.
- The models undergo two-phase pre-training using a mix of web sources and synthetic data for general knowledge and logical reasoning skills.

Performance:
- Phi-3-mini achieves competitive scores on standard benchmarks like MMLU and MT-Bench, indicating strong reasoning capabilities.
- Higher variants show even better performance, suggesting effective scaling with increased model size.

Limitations:
- phi-3-mini: limited by its smaller size in tasks requiring extensive factual knowledge, primarily supports English.
- phi-3-small limited multilingual support.

Hosting LLMs locally is a big win for OSS - private, secured inferencing on the go😎
  • 4 replies
·
posted an update 30 days ago
view post
Post
3449
# On Coding Your First Attention

While it’s not necessarily the case that you must code the attention block of a transformer from scratch to understand how it works, yet it sure is the closest you can get to having a first-principles understanding of why/how transformers behave the way they do.

@karpathy covered attention in detail in his nanoGPT video (strongly recommend watching). Now I would like to share some thoughts and experience in writing my first attention.

First let’s zoom out quickly and explain what attention is in transformers: Attention in transformers is a communication mechanism that allows the model to focus on different parts of the input sequence when making predictions.

It assigns weights to each input token based on its relevance to the current context, enabling the model to weigh information selectively. This mechanism helps transformers capture long-range dependencies and contextual information effectively.

The official AIAN paper introduced two commonly used forms of attentions: Scaled Dot-Product Attention (also known as Self-Attention) and a stack of self-attention blocks known as Multi-Head Attention.

# The Code

Now, attention as for most deep learning algorithms boils down to a math equation. So writing the code can get really trivial especially with a deep learning framework like PyTorch. Below is what's called a Single Head Attention

(image 2)

The code defines single-head attention in PyTorch - it transforms input vectors, computes attention scores and weights, and then calculates the weighted sum of values based on these weights (as per the attention equation)

When you have multiple of those stacked in parallel, you get what's called Multi-Head Attention. This gives a much simpler code if you are inheriting from the SingleHeadAttention class:

(image 3)

Full Article here: https://huggingface.co/blog/Jaward/coding-your-first-attention
  • 1 reply
·
replied to their post about 1 month ago
view reply

Closest is SadTalker: https://github.com/OpenTalker/SadTalker
its holistic facial dynamic generation is limited to: only lipsync, head movement and eye blink.

I don't think MSF will release the VASA code, they will probably commercialize on it.

replied to their post about 1 month ago
replied to their post about 1 month ago
view reply

The magic: a train pipeline that can “extract facial dynamics and head movements from real-life talking face videos”

replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
3088
Let's breakdown the technical details in Microsoft's mind blowing Lifelike audio-driven talking faces framework - VASA and model VASA-1:

Summary of Summaries
- The paper introduces VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) from a single image and speech audio.
- Core innovations include a diffusion-based model for holistic generation of facial dynamics and head movements in an expressive, disentangled face latent space developed using video data..
- VASA-1 Generates high-quality 512x512 videos at up to 40 FPS with low latency.
- Supports real-time generation of lifelike, emotive talking faces.

Summary of Overall Framework:
- VASA generates facial dynamics and head motion in latent space, conditioned on audio and other signals
- Instead of directly generating video frames, it generates holistic facial dynamics and head motion in a latent space, conditioned on audio and optional signals.
- To achieve this, the framework uses a face encoder-decoder to extract appearance and identity features and train a Diffusion Transformer model to generate motion latent codes.

Technical Method Details:
Expressive and Disentangled Face Latent Space Construction:
- Based on 3D-AID face reenactment framework
- Decomposes face into 3D appearance volume, identity code, head pose,
and facial dynamics latents
- Uses encoders to extract these latent factors from face images.
- Applies additional losses to improve disentanglement:
- Pairwise head pose and facial dynamics transfer loss
- Face identity similarity loss for cross-identity pose/dynamics transfer

Holistic Facial Dynamics Generation with Diffusion Transformer:
- Represents all facial movements (lip, expression, gaze, etc.) as a single
latent sequence
- Applies a Diffusion Transformer model to generate the facial dynamics sequence.
- Diffusion Transformer trained with simplified denoising score matching objective.
·
posted an update about 1 month ago
view post
Post
2097
Excited to share that our paper: "AutoAgents: A Framework for Automatic Agent Generation" got accepted to this year's @ IJCAI 🎉🥳

As a young aspiring AI researcher, this one means a lot, as it is the first ever paper I was blessed to contribute in. Thanks to the incredibly brilliant minds I got to work with (Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Börje F. Karlsson, Jie Fu, Yemin Shi ) - you’re heroes of mine 🫡

@karpathy I hope this is worth a mutual follow/response haha, your lessons helped shaped my understanding of this field and they still are. Thank you 🙏🏼

Demo: LinkSoul/AutoAgents
Code: https://github.com/Link-AGI/AutoAgents
Paper: https://arxiv.org/abs/2309.17288
posted an update about 1 month ago
view post
Post
2085
On Agentic AI: Autonomy Is All You Need!

There is this remarkable beauty in witnessing an AI system autonomously completes complex tasks with a level of brilliance that supersedes our reasoning capabilities/expectations - It is the holy grail of creation.

Giving your AI agents autonomy is analogous to us having "free will" and everything else thereafter is a cascade of possibilities and potentials waiting to unfold.

As brilliantly said by @AndrewNg "It’s a beautiful thing when you see an agent autonomously decide to do things in ways that you had not anticipated, and succeed as a result!"

Autonomy in Agentic AI
- augments agents' decision-making capabilities.
- enables adaptation to diverse environments.
- facilitates real-time learning and improvement.
- fosters dynamic multi-agent collaboration.
- promotes efficient and independent task execution.
- drives innovation in dynamic and unpredictable
scenarios.

This what the AutoAgents paper conveyed - A fully autonomous agentic framework that basically gives agents the free will for authentic/compelling creativity.

With just three dynamically predefined agents [ Planner, Agent Observer and Plan Observer ] acting collaboratively they are able to generically create autonomous agents:

Agent (A) =  { 
     Prompt (P) - defines agent's identity fully , 
     Description (D) - adds specific role identity, 
     Toolset (T) - equips the agent with tools, 
     Suggestion (S) - offers task execution tips
 }


Demo: LinkSoul/AutoAgents
Code: https://github.com/Link-AGI/AutoAgents
Paper: https://arxiv.org/abs/2309.17288

replied to clem's post about 1 month ago
posted an update about 1 month ago
view post
Post
2817
After giving GPU Programming a hands-on try, I have come to appreciate the level of complexity in AI compute:

- Existing/leading frameworks (CUDA, OpenCL, DSLs, even Triton), still fall at the mercy of low-level compute that requires deeper understanding and experience.
- Ambiguous optimizations methods that will literally drive you mad 🤯
- Triton is cool but not cool enough (high level abstractions that fall back to low level compute issues as you build more specialized kernels)
- As for CUDA, optimization requires considering all major components of the GPU (DRAM, SRAM, ALUs) 🤕
- Models today require stallion written GPU kernels to reduce storage and compute cost.
- GPTQ was a big save 👍🏼

@karpathy is right expertise in this area is scarce and the reason is quite obvious - uncertainties: we are still struggling to get peak performance from multi-connected GPUs while maintaining precision and reducing cost.

May the Scaling Laws favor us lol.
·
replied to their post about 2 months ago
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
4499
This is the closest I’ve seen of a scalable AI/LLM Operating System - it has all the major ingredients of a feasible AI OS 1 architecture:

- Extends classical OS functionalities with an LLM Kernel.
- Multi agent-centric approach.
- Optimized resource allocation system that allows for LLM-based tasks and Classical OS tasks to coexist.
- An Agent Scheduler that can perform classical os operations (FIFO, RR).
- A Context Manager to improve alignment.
- Lazy Memory Manager for agents (ensures data is stored and accessible only while the agent is active)
- An Enhanced security module for the AI-driven environment.

It does hit all checkpoints, doesn’t it? An upscale version of @karpathy ’s.

Code: https://github.com/agiresearch/AIOS
·
posted an update about 2 months ago
view post
Post
1335
MLX RAG with GGUF Models
Minimal, clean code implementation of RAG with mlx inferencing for GGUF models.

Code: https://github.com/Jaykef/mlx-rag-gguf

The code here builds on vegaluisjose's example, it has been optimized to support RAG-based inferencing for .gguf models. I am using BAAI/bge-small-en for the embedding model, tinyllama-1.1b-chat-v1.0.Q4_0.gguf as base model and the custom vector database script for indexing texts in a pdf file. Inference speeds can go up to ~413 tokens/sec for prompts and ~36 tokens/sec for generation on my M2 Air.

Queries make use of both .gguf (base model) and .npz (retrieval model) simultaneouly resulting in much higher inferencing speeds.
posted an update about 2 months ago
view post
Post
1736
Here it is, an Adaptive AI Assistant (AAA) only made possible through open-source, incredibly inspiring, well done to the Open Interpreter team👏🏼👏🏼

Glad this one did not start with big tech :)

Code: https://github.com/OpenInterpreter/01
posted an update 2 months ago
view post
Post
What’s missing in today’s AI?
Adaptive AI Assistant(s) (AAA)

AAA - an ai assistant that gradually reflects an intelligently amplified version of you.

All I see in today's advance AI systems are tools that are heavily engineered to do all the work while we seat back and watch.

Shouldn't it be:

- tools that augment our information processing capabilities? (as once proposed by @karpathy )
- tools that can adapt to each person's needs?
- tools that reflect a more intelligent version of ourselves?

Because think about it, if we continue to build AI systems that we end up heavily relying on for even the most complex tasks, without actively involving us in the process, we risk losing out on the opportunity for personal growth and development.

Just saying.

Triple A is a more compelling path:)

And I'm writing a paper on this.
posted an update 2 months ago
view post
Post
Prompt Engineering: Playing A Game of Chance With LLMs.

It's obvious these days, that trying to get the best out of LLMs resembles playing a game of chance, where the choice of prompts acts as your moves in shaping the model's responses, as you recursively seek the best.

Each prompt you craft carries the potential to lead the LLM down different paths, influencing the quality and relevance of its outputs. By experimenting with various prompts and observing how the model responds, you can uncover new insights into the inner workings of these complex systems and push the boundaries of what they can achieve.

Not long ago, this craftsmanship has been termed "Prompt Engineering", it's a job now. To better understand the "Engineering" of it, let's go through the paper by Google's Brain Team that shed light on it: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

The paper starts off with a clear definition of Chain-of-Thought — a coherent series of intermediate natural language reasoning steps that lead to the final answer for a problem.

The researchers explored how generating a series of intermediate reasoning steps significantly improves the ability of large language models to perform complex reasoning. They found that such reasoning abilities "emerge naturally" in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting.

Experiments on three large language models showed that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

ReadMore: https://x.com/jaykef_/status/1767173517345485232?s=46&t=V2mWOpm9AdMX0spmmr0yNQ
posted an update 2 months ago
view post
Post
You gotta love what Aapple’s mlx team cooked:

- A unified memory model that literally does compute-magic: parallel operations with automatic dependency insertions.
- Supports off-the-shelf use of all the fun stuff in composable func transformations (differentiation, vectorization, computation graph optimization).
- Houses simplified forms of all the APIs we love and in the languages we adore (python, C++, C) sorry Swift :)
- mlx.nn is a stallion 🔥 simple to use.
- Open-source friendly (who would have thought lol).
- Dynamic graph construction👍🏼
- Supports both CPU and GPU🤖
- Beginner Friendly 👌🏼
- Great examples (clean code💯)
- Good documentation

Well done Awni Hannun et al 👏🏼

Could this be The Transformer of ml frameworks? Well at least for us mac users 😂

Repo: https://github.com/ml-explore/mlx
Examples: https://github.com/ml-explore/mlx-examples
Documentation: https://ml-explore.github.io/mlx/build/html/python/nn.html
  • 1 reply
·
posted an update 2 months ago
view post
Post
Some papers deserve a standing ovation after reading, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” is one such paper:

One major drawback of LLMs is the lack of precise control over their behavior which makes it very difficult to align with desired outcomes. Existing methods to mitigate this involves gathering generated, humanly labeled data and fine-tuning the unsupervised LLM to align with preferences - this is known as Reinforcement Learning From Human Feedback (RLHF).

RLHF is an incredibly complex, usually unstable and computationally costly method. It involves first scaling a suitable reward model that meets human preferences then fine-tuning the language model with RL to maximize the estimated reward while maintaining a major part of the original model.

This paper introduces a new algorithm called Direct Preference Optimization (DPO) that simplifies the whole process. In short, it directly optimizes the LM without explicit reward modeling or reinforcement learning. This is achieved by leveraging a mapping between reward functions and optimal policies, allowing the constrained reward maximization problem to be optimized exactly with a single stage of policy training.

DPO’s genius lies in its ability to intuitively increase the relative log probability of preferred to "unpreferred" responses.

The amazing thing about this paper is how fundamentally self-proven it is - from clearly stating the problem to explicitly explaining the underlying theory backed with mathematical proofs, it’s just genius.

In my opinion, every academic research paper should follow this approach. It won the 2023 NeurIPS Outstanding paper award (Category: Outstanding Main Track Runner-Ups).
posted an update 2 months ago
view post
Post
Retrieval-Augmented Generation (RAG)
Redeemer of the "hallucination problem"

It is fair enough to argue that "hallucinations" in LLMs are just mere reflections of what we humans occasionally do - well it gets worse as we get older, but these models are brain inspired, thus such behaviors are likely inherently unavoidable. After all, we are just dreamers trying make sense of this life.

The best we can do is minimize and control it - but humanly how? By first feeding on relevant facts and then developing a habit that allows us to easily access those facts when needed. This is what RAG is all about - it's just a control mechanism that keeps the LLM aligned with reality and fact.

But How Does RAG Work?

Well, to some extent it is domain-specific but the overall workflow boils down to the following:

1. It makes use of a retrieval mechanism that hunts for facts relevant to a query - this involves an end-to-end backpropagation that leverages a retriever (Query Encoder + Document Index or Source of Truth) with a pre-trained generative model.

2. The generative model then uses the facts retrieved, performs some verification to give a more accurate response.

To summarize, the RAG architecture houses a pre-existing knowledge source model (termed parametric memory), which then utilizes a Source-of-Truth model or vector indexed data (termed non-parametric memory) that is accessed by a pre-trained neural retriever, in order to produce more informed, contextually appropriate and factually correct responses.

Sort of a "Genius Engine" if you might say. If only we humans could harness such, AGI would be much much sooner lol.

In the meantime, I have been Jaward Sesay (Chinese name 苏杰 Sujie) - a young Sierra Leonean, aspiring AI Researcher. I like to read, share and try implementing AI research papers. Also like dunking on big tech while rooting for open-source. My mentor @karpathy , I dream of him following me back on X lol. Thanks.
  • 2 replies
·
posted an update 2 months ago
view post
Post
Speaking of the missing piece in today’s generative AI: Reasoning (or more appropriately, the proper use of Common-Sense)

Human Intelligence is hinged on the brain’s ability to learn vast amounts of background knowledge about the world just by passively observing it. Such common-sense information is believed to be the enabler of intelligent behavior (planning, reasoning and grounding).

Unusual question: how do we actually learn common-sense knowledge?

Unusual opinion: I personally believe we haven’t fully understand how the brain learns, thus cannot get machines to mimic how we learn.

Well so far AI godfather Prof. Yann have quite a promising vision of how machines can learn world models like we humans do. Excited to share his vision after giving I-JEPA a read.

I-JEPA (Image-based Joint-Embedding Predictive Architecture) is a novel approach to self-supervised learning from images. This method focuses on learning semantic image features without relying on pre-set rules based on manual data changes. Instead, I-JEPA predicts the representations of multiple target blocks within a single image using a single context block.

The I-JEPA architecture consists of a context encoder, a target encoder, and a predictor. The context encoder extracts context features from a context block, while the target encoder extracts target features from the target blocks. The predictor then uses the context features to predict the target features.

One of the main advantages of I-JEPA is that it is non-generative, meaning it does not rely on pre-set rules based on manual data changes. It also uses multi-block masking, which allows it to learn semantic representations more effectively.

This is very promising, hopefully we can look back this one day and amuse at how we got it right.

Paper: https://arxiv.org/abs/2301.08243
Code: https://github.com/facebookresearch/ijepa
posted an update 2 months ago
view post
Post
LLM “Patchnization”

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.

Code: https://github.com/Jaykef/min-patchnizer

The code above, first extracts still images (frames) from a video, splits the image frames into smaller fixed-size patches, linearly embeds each of them, adds position embeddings and then saves the resulting sequence of vectors for use in a Vision Transformer encoder. I tried training the resulting sequence vectors with Karpathy's minbpe and it took ~2173s per frame to tokenize. The whole "patchnization" took ~77.4s for a 20s video on my M2 Air.

The files in the repo work as follows:

1. patchnizer.py: Holds code for simple implementation of the three stages involved (extract_image_frames from video, reduce image_frames_to_patches of fixed sizes 16x16 pixels, then linearly_embed_patches into a 1D vector sequence with additional position embeddings.

2. patchnize.py: performs the whole process with custom configs (patch_size, created dirs, video - I am using the "dogs playing in snow" video by sora).

3. train.py: Trains the resulting one-dimensional vector sequence (linear_patch_embeddings + position_embeddings) on Karpathy's minbpe (a minimal implementation of the byte-pair encoding algorithm).

4. check.py: Checks if the patch embeddings match the original image patches by recovering the image frames from their corresponding image patches.

The Patchnizer class has three stubs:
- extract_image_frames() which chops the video (20sec) into 60 frames (i.e each frame is ~0.33 secs) each of size 1280x720 pixels (original video dims).
- image_frames_to_patches() which grids each image frame into 16x16 pixels tiles. This makes each frame has a total of 3600 image patches (i.e 80 rows by 45 columns).
- linearly_embed_patches() turns the image patches into patch embeddings (a long string of integers for each image patch) then adds a position embedding for each patch.