Edit model card

Model Card for Model ID

This is a Wav2Vec2-BERT model that can be used as an audio conditioning mechanism for Stable Diffusion instead of the CLIP text encoder.

Model Details

Model Description

  • Developed by: Youssef Kardous
  • Model type: Enhanced Wav2Vec2-BERT with a custom loss function and adapter
  • License: apache-2.0
  • Finetuned from model : facebook/w2v-bert-2.0

Model Sources

Results

Uses

The audio2img project aims to enhance generative modeling by integrating audio embeddings into the conditioning process of models like Stable Diffusion. This integration allows for the exploration of new creative possibilities by leveraging the rich semantic information contained in audio data.

Potential Users:

  • Researchers and Developers

  • Artists and Creatives

  • Content Creators

Training Details

Training Data

https://huggingface.co/datasets/nateraw/fsd50k

Training Procedure

The core idea behind our training process is to achieve cross-modal alignment between audio and text embeddings using a two-stream architecture. This involves leveraging the powerful CLIPTextModel to generate text embeddings that serve as true labels for the audio embeddings produced by our Wav2Vec2Bert model.

Model Card Contact

Downloads last month
7
Safetensors
Model size
631M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using youzarsif/wav2vec2bert_2_diffusion 1