ylacombe commited on
Commit
b1f465e
1 Parent(s): 91e9159

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+
7
+ # Model
8
+
9
+ ## TL;DR
10
+
11
+ CLAP is to audio what CLIP is to image. This is an improved CLAP checkpoint, specifically trained on music and speech.
12
+
13
+ ## Description
14
+
15
+ CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
16
+
17
+
18
+ # Usage
19
+
20
+ You can use this model for zero shot audio classification or extracting audio and/or textual features.
21
+
22
+ # Uses
23
+
24
+ ## Perform zero-shot audio classification
25
+
26
+ ### Using `pipeline`
27
+
28
+ ```python
29
+ from datasets import load_dataset
30
+ from transformers import pipeline
31
+
32
+ dataset = load_dataset("ashraq/esc50")
33
+ audio = dataset["train"]["audio"][-1]["array"]
34
+
35
+ audio_classifier = pipeline(task="zero-shot-audio-classification", model="ylacombe/larger_clap_music_and_speech")
36
+ output = audio_classifier(audio, candidate_labels=["Sound of a dog", "Sound of vaccum cleaner"])
37
+ print(output)
38
+ >>> [{"score": 0.999, "label": "Sound of a dog"}, {"score": 0.001, "label": "Sound of vaccum cleaner"}]
39
+ ```
40
+
41
+ ## Run the model:
42
+
43
+ You can also get the audio and text embeddings using `ClapModel`
44
+
45
+ ### Run the model on CPU:
46
+
47
+ ```python
48
+ from datasets import load_dataset
49
+ from transformers import ClapModel, ClapProcessor
50
+
51
+ librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
52
+ audio_sample = librispeech_dummy[0]
53
+
54
+ model = ClapModel.from_pretrained("ylacombe/larger_clap_music_and_speech")
55
+ processor = ClapProcessor.from_pretrained("ylacombe/larger_clap_music_and_speech")
56
+
57
+ inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt")
58
+ audio_embed = model.get_audio_features(**inputs)
59
+ ```
60
+
61
+ ### Run the model on GPU:
62
+
63
+ ```python
64
+ from datasets import load_dataset
65
+ from transformers import ClapModel, ClapProcessor
66
+
67
+ librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
68
+ audio_sample = librispeech_dummy[0]
69
+
70
+ model = ClapModel.from_pretrained("ylacombe/larger_clap_music_and_speech").to(0)
71
+ processor = ClapProcessor.from_pretrained("ylacombe/larger_clap_music_and_speech")
72
+
73
+ inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt").to(0)
74
+ audio_embed = model.get_audio_features(**inputs)
75
+ ```
76
+
77
+
78
+ # Citation
79
+
80
+ If you are using this model for your work, please consider citing the original paper:
81
+ ```
82
+ @misc{https://doi.org/10.48550/arxiv.2211.06687,
83
+ doi = {10.48550/ARXIV.2211.06687},
84
+ url = {https://arxiv.org/abs/2211.06687},
85
+ author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
86
+ keywords = {Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
87
+ title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
88
+ publisher = {arXiv},
89
+ year = {2022},
90
+ copyright = {Creative Commons Attribution 4.0 International}
91
+ }
92
+ ```
93
+