new

Get trending papers in your email inbox!

Subscribe

Trending Papers

byAK and the research community

Trending Papers
Submitted by Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI Tongyi-MAI · Nov 27, 2025
Submitted by Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI Tongyi-MAI · Nov 27, 2025
Submitted by nuojohnchen

PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

PaperDebugger is an in-editor academic writing assistant that integrates large language models, enabling direct interaction within LaTeX editors for document state management, revision, and literature search.

Submitted by zengyh1900

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

The paper introduces Reward Forcing, which enhances video generation by updating sink tokens with EMA-Sink and using Rewarded Distribution Matching Distillation to prioritize dynamic content.

  • 12 authors
· Dec 4, 2025
Submitted by taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

facebook AI at Meta · Nov 20, 2025

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

  • 4 authors
· Dec 28, 2024
Submitted by RuoyuFeng

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Semantic-First Diffusion (SFD) enhances image generation by asynchronously denoising semantic and texture latents, improving convergence and quality.

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

  • 5 authors
· Oct 8, 2024
Submitted by taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle PaddlePaddle · Oct 16, 2025
Submitted by taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

  • 61 authors
· Sep 26, 2025
Submitted by wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

  • 18 authors
· Sep 27, 2024
Submitted by akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

  • 5 authors
· Mar 20, 2024

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

  • 9 authors
· Oct 23, 2024
Submitted by taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

facebook AI at Meta · Nov 20, 2025
Submitted by zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab RUC-DataLab · Oct 19, 2025
Submitted by akhaliq

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

WizardCoder, a Code LLM fine-tuned with complex instructions using Evol-Instruct, outperforms other open-source and closed LLMs on several code generation benchmarks.

microsoft Microsoft · Jun 14, 2023
Submitted by probejie

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

CLaRa enhances retrieval-augmented generation by introducing unified embedding-based compression and joint optimization, achieving state-of-the-art performance in QA benchmarks.

apple Apple · Nov 24, 2025
Submitted by daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

  • 8 authors
· Aug 5, 2025

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

  • 5 authors
· Feb 8, 2025
Submitted by KaituoFeng

OneThinker: All-in-one Reasoning Model for Image and Video

OneThinker, an all-in-one multimodal reasoning model, unifies image and video understanding across various tasks using RL and demonstrates strong performance and knowledge transfer.

  • 14 authors
· Dec 2, 2025
Submitted by ChrisDing1105

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

ARM-Thinker, an agentic reward model, uses external tools for verification, improving accuracy and interpretability in complex multimodal reasoning tasks.

internlm Intern Large Models · Dec 4, 2025
Submitted by thomagram

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

STARFlow-V, a normalizing flow-based video generator, offers end-to-end learning, robust causal prediction, and high-quality video generation with practical sampling efficiency.

apple Apple · Nov 25, 2025
Submitted by taesiri

HunyuanOCR Technical Report

HunyuanOCR, a lightweight Vision-Language Model, achieves state-of-the-art performance in OCR tasks through a unified end-to-end architecture combining Vision Transformer and lightweight LLM, supported by data-driven and RL strategies.

Tencent-Hunyuan Tencent Hunyuan · Nov 24, 2025
Submitted by fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

  • 10 authors
· Aug 30, 2025
Submitted by dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

  • 6 authors
· Oct 26, 2025

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

  • 5 authors
· Jan 20, 2025
Submitted by Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

  • 47 authors
· Apr 14, 2025
Submitted by Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

open-gigaai GigaAI · Oct 22, 2025
Submitted by akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

  • 5 authors
· Apr 28, 2025
Submitted by jiaruz2

Latent Collaboration in Multi-Agent Systems

LatentMAS enables efficient, lossless collaboration among LLM agents in latent space, improving performance and reducing computational costs compared to text-based methods.

Gen-Verse Princeton-AI · Nov 25, 2025
Submitted by jiamingZ

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

SteadyDancer, an Image-to-Video framework, ensures first-frame identity preservation and precise motion control through harmonized conditions, adaptive pose representation, and hierarchical training objectives.

Submitted by taesiri

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld-0 is a unified world model framework that integrates video generation and 3D modeling to produce high-quality, diverse, and physically plausible VLA data, enabling strong real-world performance in embodied AI without real-world training.

  • 25 authors
· Nov 25, 2025
Submitted by AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance-Seed ByteDance Seed · Nov 13, 2025
Submitted by Forceless

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

PPTAgent, a two-stage approach, improves presentation generation by analyzing reference presentations and ensuring structural and content consistency, outperforming traditional methods across content, design, and coherence.

  • 9 authors
· Jan 7, 2025
Submitted by tqliu

Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Light-X is a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control, outperforming existing methods in joint camera-illumination control and relighting.

  • 11 authors
· Dec 4, 2025
Submitted by xxiaoyali

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

CUDA-L2, a system combining large language models and reinforcement learning, optimizes Half-precision General Matrix Multiply CUDA kernels, achieving significant speedups over existing baselines in both offline and server modes.

deepreinforce-ai DeepReinforce · Dec 2, 2025
Submitted by Paranioar

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

NEO, a novel family of native Vision-Language Models, addresses fundamental constraints and integrates vision and language within a unified framework, achieving competitive performance with limited data.

SenseTime SenseTime · Oct 16, 2025
Submitted by wymanCV

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Stable Video Infinity generates infinite-length videos with high temporal consistency and controllable storylines by using Error-Recycling Fine-Tuning on the Diffusion Transformer.

epfl-vita EPFL VITA Lab · Oct 10, 2025
Submitted by taesiri

Code2Video: A Code-centric Paradigm for Educational Video Generation

Code2Video generates educational videos using a code-centric agent framework, improving coherence and interpretability compared to direct code generation.

showlab Show Lab · Oct 1, 2025

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

  • 11 authors
· Jun 28, 2020
Submitted by wchengad

Step1X-Edit: A Practical Framework for General Image Editing

Step1X-Edit, an image editing model that uses multimodal LLM and diffusion image decoder, outperforms open-source models and approaches proprietary models in quality.

  • 24 authors
· Apr 24, 2025

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

  • 16 authors
· Apr 21, 2023
Submitted by Owen777

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

LucidFlux, a caption-free UIR framework using a diffusion transformer, achieves robust image restoration through adaptive conditioning and SigLIP features without text prompts.

W2GenAI Lab · Sep 26, 2025
Submitted by akhaliq

PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

PosterCopilot enhances professional graphic design through a three-stage training strategy for LMMs, enabling geometrically accurate and aesthetically superior layouts with controllable iterative editing.

  • 7 authors
· Dec 3, 2025
Submitted by AlexiaJM

Less is More: Recursive Reasoning with Tiny Networks

Tiny Recursive Model (TRM) achieves high generalization on complex puzzle tasks using a small, two-layer network with minimal parameters, outperforming larger language models.

Submitted by Seongyun

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

PaperCoder is a multi-agent LLM framework that converts machine learning papers into functional code repositories through planning, analysis, and generation stages.

  • 4 authors
· Apr 24, 2025
Submitted by nielsr

Real-Time Object Detection Meets DINOv3

DEIMv2, an extended version of DEIM with DINOv3 features, achieves superior performance-cost trade-offs across diverse deployment scenarios, including GPU, edge, and mobile, by using Spatial Tuning Adapter and HGNetv2 with pruning.

  • 5 authors
· Sep 25, 2025
Submitted by SteveZeyuZhang

EvoVLA: Self-Evolving Vision-Language-Action Model

EvoVLA, a self-supervised VLA framework, enhances long-horizon robotic manipulation by addressing stage hallucination through triplet contrastive learning, pose-based exploration, and long-horizon memory, achieving improved success rates and sample efficiency on both simulated and real-world tasks.

PekingUniversity Peking University · Nov 20, 2025
Submitted by Sakits

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

VLASH is an asynchronous inference framework for Vision-Language-Action models that achieves high speed and low latency without sacrificing accuracy, enabling precise robotic tasks like ping-pong and whack-a-mole.

mit-han-lab MIT HAN Lab · Nov 30, 2025
Submitted by haodongli

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

A two-stage deterministic framework, Lotus-2, leverages diffusion models' world priors for high-quality geometric inference, achieving state-of-the-art results in monocular depth estimation and competitive surface normal prediction with limited training data.

  • 4 authors
· Nov 30, 2025