Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update Mar 7
Post
LLM “Patchnization”

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.

Code: https://github.com/Jaykef/min-patchnizer

The code above, first extracts still images (frames) from a video, splits the image frames into smaller fixed-size patches, linearly embeds each of them, adds position embeddings and then saves the resulting sequence of vectors for use in a Vision Transformer encoder. I tried training the resulting sequence vectors with Karpathy's minbpe and it took ~2173s per frame to tokenize. The whole "patchnization" took ~77.4s for a 20s video on my M2 Air.

The files in the repo work as follows:

1. patchnizer.py: Holds code for simple implementation of the three stages involved (extract_image_frames from video, reduce image_frames_to_patches of fixed sizes 16x16 pixels, then linearly_embed_patches into a 1D vector sequence with additional position embeddings.

2. patchnize.py: performs the whole process with custom configs (patch_size, created dirs, video - I am using the "dogs playing in snow" video by sora).

3. train.py: Trains the resulting one-dimensional vector sequence (linear_patch_embeddings + position_embeddings) on Karpathy's minbpe (a minimal implementation of the byte-pair encoding algorithm).

4. check.py: Checks if the patch embeddings match the original image patches by recovering the image frames from their corresponding image patches.

The Patchnizer class has three stubs:
- extract_image_frames() which chops the video (20sec) into 60 frames (i.e each frame is ~0.33 secs) each of size 1280x720 pixels (original video dims).
- image_frames_to_patches() which grids each image frame into 16x16 pixels tiles. This makes each frame has a total of 3600 image patches (i.e 80 rows by 45 columns).
- linearly_embed_patches() turns the image patches into patch embeddings (a long string of integers for each image patch) then adds a position embedding for each patch.
In this post