Papers I Like - a leegao19 Collection

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Paper • 2402.15627 • Published Feb 23 • 34

Note Megatron: > Much of our model parallel approach can be characterized as techniques aimed at reducing communication and keeping the GPUs compute bound MegaScale: 1. Architectural changes to improve pipelining / performance (PTB, SWA) 2. Overlap communication (ZeRO, fuse collective w/ weight-proj, triple-buffering b/w layer) 3. Other "micro"-optimizations (FlashAttention, fuse all the things, LAMB optimizer)

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27 • 88

Note Argues that Sora likely: 1. Encodes video into discrete tokenized latent space (e.g. vq-ViViT style) with space-time latent patches 2. Add noise 3. Feed into a standard DiT with conditioning cross-attending visual tokens 4. Auto regressively generate/remove noise on the whole video or the next frame 5. Run decoder on cleaned up latent patches (reassembled in correct aspect ratio) to get back to video pixel space I think they encode then patch (like WALT), since 3D conv preserve aspect ratio

Beyond Language Models: Byte Models are Digital World Simulators

Paper • 2402.19155 • Published Feb 29 • 49

Note CPU instructions: In this example, bGPT flawlessly executed all 251 consecutive instructions, achieving a perfect performance in modelling CPU states by predicting the next state from the current state and an instruction. For clarity, we translate byte sequences into a readable format, with the original binary file accessible here.

Note 1. Grab the gradient from backproping the loss: - G_t = -∇W φ_t(W_t) where G_t is gradient matrix at timestep t 2. Gradient Projection: - SVD for P and Q: [U,S,V] = SVD(G_t) then P_t = U[:, :r] and Q_t = V[:, :r]. Only every T steps - Low-Rank projection R_t = P_t^T G_t Q_t. (note R_t could be diagonal) 3. Weight update: - N_t = ρ_t(R_t), ρ_t represents the optimizer (e.g. Adam) update - G_t = P_t N_t Q_t^T - unproject. - W_{t+1} = W_t + η * G_t - η is learning rate

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Paper • 2304.08818 • Published Apr 18, 2023 • 7

Note Extra interesting mention in the WALT paper I missed the first few times: > However, similar to Blattmann et al. [4], we can also potentially leverage pretrained image LDMs with transformer backbones by simply interleaving STW layers. They do share very similar ViT designs - e.g. the interleaved spatial (frozen for Blattman, windowed for WALT) and (spatio)temporal layers (full space-time attention+conv3d for Blattman, windowed for WALT).