Improve model card: add metadata, abstract, and setup instructions

This PR enhances the model card for "Pulp Motion: Framing-aware multimodal camera and human motion generation" by:

- Adding `license: mit`, `pipeline_tag: text-to-video`, and `library_name: diffusers` to the YAML metadata for better discoverability and integration on the Hugging Face Hub.
- Integrating the paper's abstract directly into the model card content.
- Linking the paper to the Hugging Face Papers page: [Pulp Motion: Framing-aware multimodal camera and human motion generation](https://huggingface.co/papers/2510.05097).
- Incorporating the setup instructions from the Github README.

These changes ensure that users have a comprehensive overview of the model, its capabilities, and how it aligns with the Hugging Face ecosystem.

Files changed (1) hide show

README.md +11 -1

README.md CHANGED Viewed

@@ -1,3 +1,9 @@
 <div align="center">
 # Pulp Motion: Framing-aware multimodal camera and human motion generation
@@ -16,6 +22,10 @@
 </div>
 <div align="center">
     <a href="https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/" class="button"><b>[Webpage]</b></a> &nbsp;&nbsp;&nbsp;&nbsp;
@@ -43,4 +53,4 @@ Prepare the dataset (untar archives):
 ```
 cd pulpmotion-models
 sh download_smpl
-```

+---
+license: mit
+pipeline_tag: text-to-video
+library_name: diffusers
+---
 <div align="center">
 # Pulp Motion: Framing-aware multimodal camera and human motion generation
 </div>
+This model was presented in the paper [Pulp Motion: Framing-aware multimodal camera and human motion generation](https://huggingface.co/papers/2510.05097).
+## Abstract
+Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task.
 <div align="center">
     <a href="https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/" class="button"><b>[Webpage]</b></a> &nbsp;&nbsp;&nbsp;&nbsp;
 ```
 cd pulpmotion-models
 sh download_smpl
+```