--- license: apache-2.0 pipeline_tag: video-audio-to-text library_name: transformers --- This repository contains the pretrain and finetune weights for the model introduced in the paper "Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation". [Paper](https://huggingface.co/papers/2503.13068) Github: https://github.com/GeWu-Lab/Crab