conette / README.md
Labbeti's picture
Upload README.md with huggingface_hub
024415a
|
raw
history blame
No virus
1.49 kB
metadata
language: en
license: mit
tags:
  - audio
  - captioning
  - text
  - audio-captioning
  - automated-audio-captioning
task_categories:
  - audio-captioning

CoNeTTE (ConvNext-Transformer with Task Embedding) for Automated Audio Captioning

This model is currently in developement, and all the required files are not yet available.

This model generate a short textual description of any audio file.

Installation

pip install conette

Usage

from conette import CoNeTTEConfig, CoNeTTEModel

config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)

path = "/my/path/to/audio.wav"
outputs = model(path)
cands = outputs["cands"][0]
print(cands)

Single model performance

Dataset SPIDEr (%) SPIDEr-FL (%) FENSE (%)
AudioCaps 44.14 43.98 60.81
Clotho 30.97 30.87 51.72

Citation

The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf


Additional information

The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT. The encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.

It was created by @Labbeti.