File size: 4,290 Bytes
bf864fd
 
 
db722c0
bf864fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
960ea7a
 
3259f6e
 
 
 
 
edfae8a
 
 
 
 
3259f6e
 
 
 
 
 
 
 
 
 
391404d
3259f6e
391404d
3259f6e
 
 
 
391404d
 
 
3259f6e
391404d
3259f6e
 
 
 
 
 
 
 
 
bf864fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
960ea7a
3259f6e
 
 
 
 
960ea7a
bf864fd
16c4d0d
 
 
 
 
 
391404d
 
 
 
 
 
 
 
 
16c4d0d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
language:
- en
license: mit
tags:
- embeddings
- Speaker
- Verification
- Identification
- NAS
- TDNN
- pytorch
datasets:
- voxceleb1
- voxceleb2
metrics:
- EER
- minDCF:
  - p_target: 0.01
---


# EfficientTDNN

This repository provides all the necessary tools to perform speaker verification with a NAS alternative, named as EfficientTDNN.
The system can be used to extract speaker embeddings with different model size.
It is trained on Voxceleb2 training data using data augmentation.
The model performance on Voxceleb1-test set(Cleaned)/Vox1-O are reported as follows.

| Supernet Stage | Subnet | MACs (3s) | Params | EER(%) | minDCF |
|:-------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| depth | Base | 1.45G | 5.79M | 0.94 | 0.089 |
| width 1 | Mobile | 570.98M | 2.42M | 1.41 | 0.124 |
| width 2 | Small | 204.07M | 899.20K | 2.20 | 0.219 |

The details of three subnets are:

- Base: (3, [512, 512, 512, 512], [5, 3, 3, 3], 1536)
- Mobile: (3, [384, 256, 256, 256], [5, 3, 3, 3], 768)
- Small: (2, [256, 256, 256], [3, 3, 3], 400)

## Compute your speaker embeddings

```python
import torch
from sugar.models import WrappedModel
wav_input_16khz = torch.randn(1,10000).cuda()

repo_id = "mechanicalsea/efficient-tdnn"
supernet_filename = "depth/depth.torchparams"
subnet_filename = "depth/depth.ecapa-tdnn.3.512.512.512.512.5.3.3.3.1536.bn.tar"
subnet, info = WrappedModel.from_pretrained(repo_id=repo_id, supernet_filename=supernet_filename, subnet_filename=subnet_filename)
subnet = subnet.cuda()
subnet = subnet.eval()

embedding = subnet(wav_input_16khz)
```

## Inference on GPU

To perform inference on the GPU, add  `subnet = subnet.to(device)`  after calling the `from_pretrained` method.

## Model Description

Models are listed as follows.

- **Dynamic Kernel**: The model enables various kernel sizes in {1,3,5}, `kernel/kernel.torchparams`.
- **Dynamic Depth**: The model enables additional various depth in {2,3,4} based on **Dynamic Kernel** version, `depth/depth.torchparams`.
- **Dynamic Width 1**: The model enable additional various width in [0.5, 1.0] based on **Dynamic Depth** version, `width1/width1.torchparams`.
- **Dynamic Width 2**: The model enable additional various width in [0.25, 0.5] based on **Dynamic Width 1** version, `width2/width2.torchparams`.

Furthermore, some subnets are given in the form of the weights of batchnorm corresponding to their trained supernets as follows.

- **Dynamic Kernel**
  1. `kernel/kernel.max.bn.tar`
  2. `kernel/kernel.Kmin.bn.tar`
- **Dynamic Depth**
  1. `depth/depth.max.bn.tar`
  2. `depth/depth.Kmin.bn.tar`
  3. `depth/depth.Dmin.bn.tar`
  4. `depth/depth.3.512.5.5.3.3.1536.bn.tar`
  5. `depth/depth.ecapa-tdnn.3.512.512.512.512.5.3.3.3.1536.bn.tar`
- **Dynamic Width 1**
  1. `width1/width1.torchparams`
  2. `width1/width1.max.bn.tar`
  3. `width1/width1.Kmin.bn.tar`
  4. `width1/width1.Dmin.bn.tar`
  5. `width1/width1.C1min.bn.tar`
  6. `width1/width1.3.383.256.256.256.5.3.3.3.768.bn.tar`
- **Dynamic Width 2**
  1. `width2/width2.max.bn.tar`
  2. `width2/width2.Kmin.bn.tar`
  3. `width2/width2.Dmin.bn.tar`
  4. `width2/width2.C1min.bn.tar`
  5. `width2/width2.C2min.bn.tar`
  6. `width2/width2.3.384.3.1152.bn.tar`
  7. `width2/width2.3.256.256.384.384.1.3.5.3.1152.bn.tar`
  8. `width2/width2.2.256.256.256.3.3.3.400.bn.tar`

The tag is described as follows.

- max: (4, [512, 512, 512, 512, 512], [5, 5, 5, 5, 5], 1536)
- Kmin: (4, [512, 512, 512, 512, 512], [1, 1, 1, 1, 1], 1536)
- Dmin: (2, [512, 512, 512], [1, 1, 1], 1536)
- C1min: (2, [256, 256, 256], [1, 1, 1], 768)
- C2min: (2, [128, 128, 128], [1, 1, 1], 384)

More details about EfficentTDNN can be found in the paper [EfficientTDNN](https://arxiv.org/abs/2103.13581).

## **Citing EfficientTDNN**

Please, cite EfficientTDNN if you use it for your research or business.

```bibtex
@article{wr-efficienttdnn-2022,
  author={Wang, Rui and Wei, Zhihua and Duan, Haoran and Ji, Shouling and Long, Yang and Hong, Zhen},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={EfficientTDNN: Efficient Architecture Search for Speaker Recognition}, 
  year={2022},
  volume={30},
  number={},
  pages={2267-2279},
  doi={10.1109/TASLP.2022.3182856}}
```