--- license: apache-2.0 language: - en tags: - audio-captioning - audiocaps - clotho - dcase-challenge - icassp-24 --- ## Summary This repo contains the config & pretrained weights of the model described in the following paper: - **Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation** Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe Int. Conf. on Acoustics, Speech, and Signal Processing (**ICASSP**) 2024 [[arXiv page](https://arxiv.org/abs/2309.17352)] ## GitHub Repository To use this model, please refer to our code published at: - https://github.com/slSeanWU/beats-conformer-bart-audio-captioner ## Training Data - Pretrain - **AudioCaps**: https://github.com/cdjkim/audiocaps/tree/master - **ChatGPT mix-ups from Clotho**: https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K - Finetune - **Clotho (V2)**: https://zenodo.org/records/4783391 ## BibTex If you find our model useful, please consider citing our paper. Thanks! ``` @inproceedings{wu2024improving, title={Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation}, author={Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Fran{\c{c}}ois and Le Roux, Jonathan and Watanabe, Shinji}, booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)}, year={2024} } ```