Memento: Reconstruct to Remember for Consistent Long Video Generation
ERNIE Team, Baidu Inc.
Abstract
Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten.
In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions.
Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.
Model Card
| Property | Value |
|---|---|
| Base model | Wan2.2-A14B (T2V + I2V) |
| Architecture | Dual DiT (low-noise + high-noise) with flow matching, boundary at t=0.9 |
| LoRA rank | 128 |
| LoRA targets | Self-attention (Q/K/V/O) + Cross-attention (Q/K/V/O) + FFN |
| Memory module | KeyframeQuery (10 local + 2 global learnable queries) |
| Precision | bfloat16 |
| Resolution | 832×480, 81 frames per shot (~5s @ 16fps) |
Files
backbone_high_noise.safetensors # High-noise DiT: LoRA + KeyframeQuery (t ≥ 0.9)
backbone_low_noise.safetensors # Low-noise DiT: LoRA + KeyframeQuery (t < 0.9)
config.json # Model & inference configuration
README.md
Each weight file contains 824 tensors: 800 LoRA parameters + 24 KeyframeQuery parameters.
Usage
Clone the inference code and download the base Wan2.2 models:
git clone https://github.com/ernie-research/Memento.git
cd Memento
# Download base models (Wan2.2-T2V-A14B and Wan2.2-I2V-A14B)
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./models/Wan2.2-T2V-A14B
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./models/Wan2.2-I2V-A14B
# Download Memento weights
huggingface-cli download ernie-research/Memento --local-dir ./models/memento
Single-story inference (8× A100 80GB recommended):
bash run_inference.sh \
./story_rewritten_aligned/showcase/astronaut.json \
./results/astronaut \
./models/memento
See the GitHub repository for full inference and training instructions, story script formats, and example data.
Citation
@article{memento2026,
title = {Memento: Reconstruct to Remember for Consistent Long Video Generation},
author = {Wei, Xuan and Ji, Longbin and Wang, Guan and Liu, Xiangrui and Zhang, Zhenyu and Wang, Shuohuan and Sun, Yu and Hong, Qingqi},
year = {2026}
}
License
Apache 2.0. The base Wan2.2 models are subject to their own licenses.
- Downloads last month
- 1
Model tree for ernie-research/Memento
Base model
Wan-AI/Wan2.2-I2V-A14B