---
license: openrail
---

# This repo contains the VPLM Dataset and pretrained checkpoints for RACCooN


See also: https://github.com/jaehong31/RACCooN

<br>
<img width="800" src="assets/raccoon_teaser.png"/>
<br>
<br>

RACCooN is a versatile and user-friendly video-to-paragraph-to-video generative framework 
that supports multiple video editing capabilities such as removal, addition, and modification, 
through a unified pipeline. RACCooN consists of two principal stages: 
Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V).

RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. It supports the addition of video objects, inpainting, and attribute modification within a unified framework, surpassing existing video editing and inpainting benchmarks. 

<br>
<img width="800" src="assets/raccoon_method.png"/>
<br>

<br>
<br>

# Description of VPLM Dataset

Multi-Objects Description
- Train: RACCooN/VPLM/gt_train.json
- Test: RACCooN/VPLM/gt_test.json

Single-Object Layout Prediction
- Train: RACCooN/VPLM/gt_train_layouts.json
- Test: RACCooN/VPLM/gt_test_layouts.json


# Description of Model Checkpoints

## V2P

Multi-Objects Description
- RACCooN/mllm_finetuned/multi_obj_projector.bin

Single-Object Description
- RACCooN/mllm_finetuned/single_obj_projector.bin

Single-Object Layout Prediction
- RACCooN/mllm_finetuned/layout_pred_projector.bin


## P2V
- RACCooN/unet_finetuned/diffusion_pytorch_model.safetensors