jimbozhang commited on
Commit
9cd3efa
1 Parent(s): 09a7319

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -8
README.md CHANGED
@@ -8,31 +8,25 @@ pipeline_tag: audio-classification
8
  ---
9
 
10
  # CED-Base Model
11
-
12
  CED are simple ViT-Transformer-based models for audio tagging. Notable differences from other available models include:
13
  1. Simplification for finetuning: Batchnormalization of Mel-Spectrograms. During finetuning one does not need to first compute mean/variance over the dataset, which is common for AST.
14
  1. Support for variable length inputs. Most other models use a static time-frequency position embedding, which hinders the model's generalization to segments shorter than 10s. Many previous transformers simply pad their input to 10s in order to avoid the performance impact, which in turn slows down training/inference drastically.
15
  1. Training/Inference speedup: 64-dimensional mel-filterbanks and 16x16 patches without overlap, leading to 248 patches from a 10s spectrogram. In comparison, AST uses 128 mel-filterbanks with 16x16 (10x10 overlap) convolution, leading to 1212 patches during training/inference. CED-Tiny runs on a common CPU as fast as a comparable MobileNetV3.
16
  1. Performance: CED with 10M parameters outperforms the majority of previous approaches (~80M).
17
 
18
- The abstract from the paper is the following:
19
-
20
- Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS.
21
-
22
  ### Model Sources
23
-
24
  - **Original Repository:** https://github.com/RicherMans/CED
25
  - **Repository:** https://github.com/jimbozhang/hf_transformers_custom_model_ced
26
  - **Paper:** [CED: Consistent ensemble distillation for audio tagging](https://arxiv.org/abs/2308.11957)
27
  - **Demo:** https://huggingface.co/spaces/mispeech/ced-base
28
 
29
- ## Uses
30
-
31
  ```bash
32
  git clone https://github.com/jimbozhang/hf_transformers_custom_model_ced.git
33
  pip install -r requirements.txt
34
  ```
35
 
 
36
  ```python
37
  >>> from ced_model.feature_extraction_ced import CedFeatureExtractor
38
  >>> from ced_model.modeling_ced import CedForAudioClassification
@@ -53,3 +47,6 @@ pip install -r requirements.txt
53
  >>> model.config.id2label[predicted_class_ids]
54
  'Finger snapping'
55
  ```
 
 
 
 
8
  ---
9
 
10
  # CED-Base Model
 
11
  CED are simple ViT-Transformer-based models for audio tagging. Notable differences from other available models include:
12
  1. Simplification for finetuning: Batchnormalization of Mel-Spectrograms. During finetuning one does not need to first compute mean/variance over the dataset, which is common for AST.
13
  1. Support for variable length inputs. Most other models use a static time-frequency position embedding, which hinders the model's generalization to segments shorter than 10s. Many previous transformers simply pad their input to 10s in order to avoid the performance impact, which in turn slows down training/inference drastically.
14
  1. Training/Inference speedup: 64-dimensional mel-filterbanks and 16x16 patches without overlap, leading to 248 patches from a 10s spectrogram. In comparison, AST uses 128 mel-filterbanks with 16x16 (10x10 overlap) convolution, leading to 1212 patches during training/inference. CED-Tiny runs on a common CPU as fast as a comparable MobileNetV3.
15
  1. Performance: CED with 10M parameters outperforms the majority of previous approaches (~80M).
16
 
 
 
 
 
17
  ### Model Sources
 
18
  - **Original Repository:** https://github.com/RicherMans/CED
19
  - **Repository:** https://github.com/jimbozhang/hf_transformers_custom_model_ced
20
  - **Paper:** [CED: Consistent ensemble distillation for audio tagging](https://arxiv.org/abs/2308.11957)
21
  - **Demo:** https://huggingface.co/spaces/mispeech/ced-base
22
 
23
+ ## Install
 
24
  ```bash
25
  git clone https://github.com/jimbozhang/hf_transformers_custom_model_ced.git
26
  pip install -r requirements.txt
27
  ```
28
 
29
+ ## Inference
30
  ```python
31
  >>> from ced_model.feature_extraction_ced import CedFeatureExtractor
32
  >>> from ced_model.modeling_ced import CedForAudioClassification
 
47
  >>> model.config.id2label[predicted_class_ids]
48
  'Finger snapping'
49
  ```
50
+
51
+ ## Fine-tuning
52
+ [`example_finetune_esc50.ipynb`](https://github.com/jimbozhang/hf_transformers_custom_model_ced/blob/main/example_finetune_esc50.ipynb) demonstrates how to train a linear head on the ESC-50 dataset with the CED encoder frozen.