CascadeFormerCheckpoints / README.md

YusenPeng

Update README.md

3f20e88 verified 3 days ago

preview code

raw

history blame contribute delete

2.6 kB

metadata

license: mit

🌊 CascadeFormer: Two-stage Cascading Transformer for Human Action Recognition

News

[August 31, 2025] paper available on arXiv!
[July 19, 2025] model checkpoints are publicly available on HuggingFace for further analysis/application!

CascadeFormer

Overview of the masked pretraining component in CascadeFormer. A fixed percentage of joints are randomly masked across all frames in each video. The partially masked skeleton sequence is passed through a feature extraction module to produce frame-level embeddings, which are then input into a temporal transformer (T1). A lightweight linear decoder is applied to reconstruct the masked joints, and the model is optimized using mean squared error over the masked positions. This stage enables the model to learn generalizable spatiotemporal representations prior to supervised finetuning.

Overview of the cascading finetuning component in CascadeFormer. The frame embeddings produced by the pre- trained temporal transformer backbone (T1) are passed into a task-specific transformer (T2) for hierarchical refinement. The output of T2 is fused with the original embeddings via a cross-attention module. The resulting fused representations are ag- gregated through frame-level average pooling and passed to a lightweight classification head. The entire model—including T1, T2, and the classification head—is optimized using cross-entropy loss on action labels during finetuning

Evaluation

Overall accuracy evaluation results of CascadeFormer variants on three datasets. CascadeFormer 1.0 consistently achieves the highest accuracy on Penn Action and both NTU60 splits, while 1.1 excels on N-UCLA. All checkpoints are open- sourced for reproducibility.

Citation

Please cite our work if you find it useful/helpful:

@misc{peng2025cascadeformerfamilytwostagecascading,
      title={CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition}, 
      author={Yusen Peng and Alper Yilmaz},
      year={2025},
      eprint={2509.00692},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.00692}, 
}

Contacts

If you have any questions or suggestions, feel free to contact:

Yusen Peng (peng.1007@osu.edu)
Alper Yilmaz (yilmaz.15@osu.edu)

Or describe it in Issues.