6 146 184

Inui

Norm

https://normxu.github.io/

AI & ML interests

Video Diffusion; Large Language Model; Object Detection; OCR

Recent Activity

liked a dataset 5 days ago

facebook/PE-Video

upvoted a paper 7 days ago

Adding Conditional Control to Text-to-Image Diffusion Models

upvoted a paper 7 days ago

Seedream 3.0 Technical Report

View all activity

Organizations

Norm's activity

liked a dataset 5 days ago

facebook/PE-Video

Viewer • Updated 6 days ago • 118k • 5.42k • 16

upvoted 2 papers 7 days ago

Adding Conditional Control to Text-to-Image Diffusion Models

Paper • 2302.05543 • Published Feb 10, 2023 • 51

Seedream 3.0 Technical Report

Paper • 2504.11346 • Published 9 days ago • 47

liked a model 14 days ago

moonshotai/Kimi-VL-A3B-Instruct

Image-Text-to-Text • Updated 4 days ago • 30k • 179

updated a collection 25 days ago

Multimodal Language Model

Collection

What does matter besides data receipt when training a Multimodal language model? • 32 items • Updated 25 days ago • 1

updated a collection about 1 month ago

Multimodal Language Model

Collection

What does matter besides data receipt when training a Multimodal language model? • 32 items • Updated 25 days ago • 1

reacted to Kseniase's post with 🚀 about 2 months ago

Post

4121

5 New implementations of Diffusion Models

Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.

Here are 5 new implementations of diffusion models:

1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.

2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.

3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.

4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.

5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.