Model Card for Model ID

This is a Wav2Vec2-BERT model that can be used as an audio conditioning mechanism for Stable Diffusion instead of the CLIP text encoder.

Model Details

Model Description

Developed by: Youssef Kardous
Model type: Enhanced Wav2Vec2-BERT with a custom loss function and adapter
License: apache-2.0
Finetuned from model : facebook/w2v-bert-2.0

Model Sources

Repository: https://github.com/kardSIM/audio2img
Demo: https://huggingface.co/spaces/youzarsif/audio2img

Results

Uses

The audio2img project aims to enhance generative modeling by integrating audio embeddings into the conditioning process of models like Stable Diffusion. This integration allows for the exploration of new creative possibilities by leveraging the rich semantic information contained in audio data.

Potential Users:

Researchers and Developers
Artists and Creatives
Content Creators

Training Details

Training Data

https://huggingface.co/datasets/nateraw/fsd50k

Training Procedure

The core idea behind our training process is to achieve cross-modal alignment between audio and text embeddings using a two-stream architecture. This involves leveraging the powerful CLIPTextModel to generate text embeddings that serve as true labels for the audio embeddings produced by our Wav2Vec2Bert model.

Model Card Contact

Email: youssefkardous1@gmail.com

youzarsif
/

wav2vec2bert_2_diffusion