File size: 3,545 Bytes
68b52bf f3583fa c47d441 f3583fa c47d441 68b52bf 845ac2d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
license: ms-pl
tags:
- contrastive audio language pretraining
- audio
- music
- emotion
- sound events
- bioacoustics
- retrieval
- captioning
- zero-shot
- audio-text
- CLAP
---
###### [Overview](#CLAP) | [Setup](#Setup) | [CLAP weights](#CLAP-weights) | [Usage](#Usage) | [Examples](#Examples) | [Citation](#Citation)
# CLAP
CLAP (Contrastive Language-Audio Pretraining) is a model that learns acoustic concepts from natural language supervision and enables “Zero-Shot” inference. The model has been extensively evaluated in 26 audio downstream tasks achieving SoTA in several of them including classification, retrieval, and captioning.
<img width="832" alt="clap_diagrams" src="docs/clap2_diagram.png">
## Setup
First, install python 3.8 or higher (3.11 recommended). Then, install CLAP using either of the following:
```shell
# Install pypi pacakge
pip install msclap
# Or Install latest (unstable) git source
pip install git+https://github.com/microsoft/CLAP.git
```
## NEW CLAP weights
CLAP weights: versions _2022_, _2023_, and _clapcap_
_clapcap_ is the audio captioning model that uses the 2023 encoders.
## Usage
CLAP code is in https://github.com/microsoft/CLAP
- Zero-Shot Classification and Retrieval
```python
from msclap import CLAP
# Load model (Choose between versions '2022' or '2023')
clap_model = CLAP("<PATH TO WEIGHTS>", version = '2023', use_cuda=False)
# Extract text embeddings
text_embeddings = clap_model.get_text_embeddings(class_labels: List[str])
# Extract audio embeddings
audio_embeddings = clap_model.get_audio_embeddings(file_paths: List[str])
# Compute similarity between audio and text embeddings
similarities = clap_model.compute_similarity(audio_embeddings, text_embeddings)
```
- Audio Captioning
```python
from msclap import CLAP
# Load model (Choose version 'clapcap')
clap_model = CLAP("<PATH TO WEIGHTS>", version = 'clapcap', use_cuda=False)
# Generate audio captions
captions = clap_model.generate_caption(file_paths: List[str])
```
## Citation
Kindly cite our work if you find it useful.
[CLAP: Learning Audio Concepts from Natural Language Supervision](https://ieeexplore.ieee.org/abstract/document/10095889)
```
@inproceedings{CLAP2022,
title={Clap learning audio concepts from natural language supervision},
author={Elizalde, Benjamin and Deshmukh, Soham and Al Ismail, Mahmoud and Wang, Huaming},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}
```
[Natural Language Supervision for General-Purpose Audio Representations](https://arxiv.org/abs/2309.05767)
```
@misc{CLAP2023,
title={Natural Language Supervision for General-Purpose Audio Representations},
author={Benjamin Elizalde and Soham Deshmukh and Huaming Wang},
year={2023},
eprint={2309.05767},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2309.05767}
}
```
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
|