Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,98 @@
|
|
1 |
---
|
2 |
license: ms-pl
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: ms-pl
|
3 |
---
|
4 |
+
|
5 |
+
###### [Overview](#CLAP) | [Setup](#Setup) | [CLAP weights](#CLAP-weights) | [Usage](#Usage) | [Examples](#Examples) | [Citation](#Citation)
|
6 |
+
|
7 |
+
# CLAP
|
8 |
+
|
9 |
+
CLAP (Contrastive Language-Audio Pretraining) is a model that learns acoustic concepts from natural language supervision and enables “Zero-Shot” inference. The model has been extensively evaluated in 26 audio downstream tasks achieving SoTA in several of them including classification, retrieval, and captioning.
|
10 |
+
|
11 |
+
<img width="832" alt="clap_diagrams" src="docs/clap2_diagram.png">
|
12 |
+
|
13 |
+
## Setup
|
14 |
+
|
15 |
+
First, install python 3.8 or higher (3.11 recommended). Then, install CLAP using either of the following:
|
16 |
+
|
17 |
+
```shell
|
18 |
+
# Install pypi pacakge
|
19 |
+
pip install msclap
|
20 |
+
|
21 |
+
# Or Install latest (unstable) git source
|
22 |
+
pip install git+https://github.com/microsoft/CLAP.git
|
23 |
+
```
|
24 |
+
|
25 |
+
## NEW CLAP weights
|
26 |
+
CLAP weights: versions _2022_, _2023_, and _clapcap_
|
27 |
+
|
28 |
+
_clapcap_ is the audio captioning model that uses the 2023 encoders.
|
29 |
+
|
30 |
+
## Usage
|
31 |
+
|
32 |
+
CLAP code is in https://github.com/microsoft/CLAP
|
33 |
+
|
34 |
+
- Zero-Shot Classification and Retrieval
|
35 |
+
```python
|
36 |
+
from msclap import CLAP
|
37 |
+
|
38 |
+
# Load model (Choose between versions '2022' or '2023')
|
39 |
+
clap_model = CLAP("<PATH TO WEIGHTS>", version = '2023', use_cuda=False)
|
40 |
+
|
41 |
+
# Extract text embeddings
|
42 |
+
text_embeddings = clap_model.get_text_embeddings(class_labels: List[str])
|
43 |
+
|
44 |
+
# Extract audio embeddings
|
45 |
+
audio_embeddings = clap_model.get_audio_embeddings(file_paths: List[str])
|
46 |
+
|
47 |
+
# Compute similarity between audio and text embeddings
|
48 |
+
similarities = clap_model.compute_similarity(audio_embeddings, text_embeddings)
|
49 |
+
```
|
50 |
+
|
51 |
+
- Audio Captioning
|
52 |
+
```python
|
53 |
+
from msclap import CLAP
|
54 |
+
|
55 |
+
# Load model (Choose version 'clapcap')
|
56 |
+
clap_model = CLAP("<PATH TO WEIGHTS>", version = 'clapcap', use_cuda=False)
|
57 |
+
|
58 |
+
# Generate audio captions
|
59 |
+
captions = clap_model.generate_caption(file_paths: List[str])
|
60 |
+
```
|
61 |
+
|
62 |
+
|
63 |
+
## Citation
|
64 |
+
|
65 |
+
Kindly cite our work if you find it useful.
|
66 |
+
|
67 |
+
[CLAP: Learning Audio Concepts from Natural Language Supervision](https://ieeexplore.ieee.org/abstract/document/10095889)
|
68 |
+
```
|
69 |
+
@inproceedings{CLAP2022,
|
70 |
+
title={Clap learning audio concepts from natural language supervision},
|
71 |
+
author={Elizalde, Benjamin and Deshmukh, Soham and Al Ismail, Mahmoud and Wang, Huaming},
|
72 |
+
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
|
73 |
+
pages={1--5},
|
74 |
+
year={2023},
|
75 |
+
organization={IEEE}
|
76 |
+
}
|
77 |
+
```
|
78 |
+
|
79 |
+
[Natural Language Supervision for General-Purpose Audio Representations](https://arxiv.org/abs/2309.05767)
|
80 |
+
```
|
81 |
+
@misc{CLAP2023,
|
82 |
+
title={Natural Language Supervision for General-Purpose Audio Representations},
|
83 |
+
author={Benjamin Elizalde and Soham Deshmukh and Huaming Wang},
|
84 |
+
year={2023},
|
85 |
+
eprint={2309.05767},
|
86 |
+
archivePrefix={arXiv},
|
87 |
+
primaryClass={cs.SD},
|
88 |
+
url={https://arxiv.org/abs/2309.05767}
|
89 |
+
}
|
90 |
+
```
|
91 |
+
|
92 |
+
## Trademarks
|
93 |
+
|
94 |
+
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
95 |
+
trademarks or logos is subject to and must follow
|
96 |
+
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
|
97 |
+
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
|
98 |
+
Any use of third-party trademarks or logos are subject to those third-party's policies.
|