|
--- |
|
language: en |
|
license: mit |
|
tags: |
|
- audio |
|
- captioning |
|
- text |
|
- audio-captioning |
|
- automated-audio-captioning |
|
task_categories: |
|
- audio-captioning |
|
--- |
|
|
|
# CoNeTTE (ConvNext-Transformer with Task Embedding) for Automated Audio Captioning |
|
|
|
<font color='red'>This model is currently in developement, and all the required files are not yet available.</font> |
|
|
|
This model generate a short textual description of any audio file. |
|
|
|
## Installation |
|
```bash |
|
pip install conette |
|
``` |
|
|
|
## Usage |
|
```py |
|
from conette import CoNeTTEConfig, CoNeTTEModel |
|
|
|
config = CoNeTTEConfig.from_pretrained("Labbeti/conette") |
|
model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config) |
|
|
|
path = "/my/path/to/audio.wav" |
|
outputs = model(path) |
|
cands = outputs["cands"][0] |
|
print(cands) |
|
``` |
|
|
|
## Single model performance |
|
| Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | |
|
| ------------- | ------------- | ------------- | ------------- | |
|
| AudioCaps | 44.14 | 43.98 | 60.81 | |
|
| Clotho | 30.97 | 30.87 | 51.72 | |
|
|
|
## Citation |
|
The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf |
|
|
|
``` |
|
|
|
``` |
|
|
|
## Additional information |
|
|
|
The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT. |
|
The encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843. |
|
|
|
It was created by [@Labbeti](https://hf.co/Labbeti). |