gcoderw commited on
Commit
7a716db
·
verified ·
1 Parent(s): a88faf1

Publish AS-20M with MAEB audio-only comparison

Browse files
Files changed (5) hide show
  1. AS-20M.safetensors +3 -0
  2. README.md +145 -0
  3. config.json +35 -0
  4. manifest.json +79 -0
  5. preprocessor_config.json +12 -0
AS-20M.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77152ea3cd4a9f841cb88230b829cce3fe68afa3d1eef7b41e01f32537859ca3
3
+ size 79578032
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - audio
7
+ - speech
8
+ - embedding
9
+ - retrieval
10
+ - feature-extraction
11
+ - efficientat
12
+ - matryoshka
13
+ - memory-augmentation
14
+ library_name: pytorch
15
+ pipeline_tag: feature-extraction
16
+ datasets:
17
+ - custom
18
+ ---
19
+
20
+ # AS-20M
21
+
22
+ `AS-20M` is a standalone audio + speech embedding encoder for
23
+ human-memory augmentation workloads. It uses a native `mn20_as` EfficientAT
24
+ backbone with the speech/audio LoRA training merged into the released weights,
25
+ so inference does not require loading a separate adapter.
26
+
27
+ Canonical name:
28
+
29
+ - `AS` = audio + speech
30
+ - `20M` = 19,837,720 loaded parameters, rounded to integer millions
31
+
32
+ ## Runtime Contract
33
+
34
+ Input is mono audio resampled to 32 kHz. The expected preprocessing is the
35
+ EfficientAT mel frontend used during training:
36
+
37
+ - sample rate: `32000`
38
+ - FFT: `1024`
39
+ - window length: `800`
40
+ - hop size: `320`
41
+ - mel bins: `128`
42
+
43
+ The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles,
44
+ truncate and renormalize:
45
+
46
+ ```text
47
+ z1280 = l2norm(model(audio))
48
+ z768 = l2norm(z1280[0:768])
49
+ z512 = l2norm(z1280[0:512])
50
+ z256 = l2norm(z1280[0:256])
51
+ z128 = l2norm(z1280[0:128])
52
+ ```
53
+
54
+ ## Artifacts
55
+
56
+ - `AS-20M.safetensors`: standalone native EfficientAT embedding model
57
+ - `config.json`: release and architecture metadata
58
+ - `preprocessor_config.json`: waveform and mel frontend contract
59
+ - `manifest.json`: file hashes and source checkpoint lineage
60
+
61
+ ## Training Summary
62
+
63
+ This checkpoint was continued from the balanced native `mn20_as` student and
64
+ trained on an audio-heavy mix of synthetic speech/audio alignment data. The
65
+ published artifact contains merged weights, not a runtime LoRA adapter.
66
+
67
+ Source checkpoint:
68
+
69
+ ```text
70
+ triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
71
+ ```
72
+
73
+ Merged LoRA source:
74
+
75
+ ```text
76
+ triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
77
+ ```
78
+
79
+ ## Local Gate Metrics
80
+
81
+ The checkpoint-local heldout gate reported:
82
+
83
+ | Metric | Score |
84
+ |---|---:|
85
+ | audio cosine | 0.8108 |
86
+ | embedding Pearson | 0.7953 |
87
+ | similarity Pearson | 0.8853 |
88
+ | audio-to-text R@1, 1280d | 0.3216 |
89
+ | text-to-audio R@1, 1280d | 0.3028 |
90
+
91
+ ## MAEB Audio-Only Comparison
92
+
93
+ This comparison uses the same 20 MAEB audio-only tasks for all three
94
+ standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded
95
+ because base `mn20_as` and Whisper-Tiny do not include a compatible text
96
+ encoder; no text adapters were invented for those baselines.
97
+
98
+ Validation: each run completed 20/20 tasks with `exception_count=0`.
99
+
100
+ | Model | Params | Native output | Mean primary |
101
+ |---|---:|---:|---:|
102
+ | base `mn20_as` | 17.9M | 1920d audio feature | 0.3977 |
103
+ | Whisper-Tiny encoder | 8.2M encoder / 37.8M full | 384d pooled encoder state | 0.3320 |
104
+ | `AS-20M` | 19.8M | 1280d embedding | 0.4083 |
105
+
106
+ | Task | base `mn20_as` | Whisper-Tiny | `AS-20M` |
107
+ |---|---:|---:|---:|
108
+ | BeijingOpera | 0.8470 | 0.5933 | 0.8349 |
109
+ | BirdCLEF | 0.2070 | 0.0730 | 0.1730 |
110
+ | CREMADPairClassification | 0.5458 | 0.5752 | 0.5475 |
111
+ | CREMA_D | 0.2804 | 0.2995 | 0.3351 |
112
+ | CREMA_DClustering | 0.0229 | 0.0955 | 0.0943 |
113
+ | CommonLanguageAgeDetection | 0.1401 | 0.2108 | 0.1799 |
114
+ | FSD2019Kaggle | 0.5734 | 0.0964 | 0.6230 |
115
+ | GTZANAudioReranking | 0.8298 | 0.6340 | 0.7747 |
116
+ | GTZANGenre | 0.8260 | 0.4550 | 0.7300 |
117
+ | IEMOCAPGender | 0.7790 | 0.5269 | 0.7712 |
118
+ | JamAltArtistA2ARetrieval | 0.8981 | 0.6786 | 0.8490 |
119
+ | MInDS14 | 0.0818 | 0.1057 | 0.0967 |
120
+ | MridinghamTonic | 0.3434 | 0.3080 | 0.3450 |
121
+ | NMSQAPairClassification | 0.4714 | 0.4360 | 0.5875 |
122
+ | SIBFLEURS | 0.1515 | 0.1554 | 0.1456 |
123
+ | VehicleSoundClustering | 0.0065 | 0.1194 | 0.0162 |
124
+ | VoxCelebSA | 0.2377 | 0.1673 | 0.2601 |
125
+ | VoxPopuliAccentPairClassification | 0.5158 | 0.5196 | 0.5235 |
126
+ | VoxPopuliGenderClustering | 0.0057 | 0.0008 | 0.0014 |
127
+ | VoxPopuliLanguageID | 0.1900 | 0.5900 | 0.2780 |
128
+
129
+ Interpretation: `AS-20M` is slightly ahead on the 20-task audio-only mean,
130
+ while base `mn20_as` remains stronger on several music/general-audio tasks.
131
+ Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is
132
+ not a general audio embedding model and is weaker on broad environmental-audio
133
+ coverage in this comparison.
134
+
135
+ Artifacts:
136
+
137
+ - `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md`
138
+ - `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json`
139
+
140
+ ## Limitations
141
+
142
+ `AS-20M` is an embedding model only. It does not transcribe speech,
143
+ classify audio events directly, or include a text encoder in this standalone
144
+ release artifact. Text-audio retrieval evaluations use a separate compatible
145
+ text encoder/head to score cross-modal alignment.
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "EfficientATMn20ASNativeEmbedding"
4
+ ],
5
+ "embedding_dim": 1280,
6
+ "matryoshka_dims": [
7
+ 1280,
8
+ 768,
9
+ 512,
10
+ 256,
11
+ 128
12
+ ],
13
+ "mel": {
14
+ "freqm": 0,
15
+ "hopsize": 320,
16
+ "n_fft": 1024,
17
+ "n_mels": 128,
18
+ "sample_rate": 32000,
19
+ "timem": 0,
20
+ "win_length": 800
21
+ },
22
+ "merged_lora_source": "triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
23
+ "modalities": [
24
+ "audio",
25
+ "speech"
26
+ ],
27
+ "model_id": "AS-20M",
28
+ "model_type": "native_efficientat_audio_embedding",
29
+ "normalize_embeddings": true,
30
+ "parameter_count": 19837720,
31
+ "sample_rate": 32000,
32
+ "source_checkpoint_sha256": "f43003f4d8dbc1eaa0095e1f3cab608ecca3309e77f579e5078c269c899ade52",
33
+ "state_tensor_elements": 19886566,
34
+ "student_model": "mn20_as"
35
+ }
manifest.json ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "artifacts": {
3
+ "safetensors": {
4
+ "bytes": 79578032,
5
+ "parameter_count": 19837720,
6
+ "path": "AS-20M.safetensors",
7
+ "sha256": "77152ea3cd4a9f841cb88230b829cce3fe68afa3d1eef7b41e01f32537859ca3",
8
+ "state_tensor_count": 312,
9
+ "state_tensor_elements": 19886566
10
+ }
11
+ },
12
+ "canonical_name_rule": "<modality>-<size>, modalities sorted alphabetically",
13
+ "checkpoint_epoch": 4,
14
+ "checkpoint_metrics": {
15
+ "audio_cos": 0.8108276128768921,
16
+ "embed_pearson": 0.7953315377235413,
17
+ "sim_pearson": 0.88530433177948,
18
+ "student_at_r10_128": 0.6377999782562256,
19
+ "student_at_r10_1280": 0.6571999788284302,
20
+ "student_at_r10_256": 0.6527999639511108,
21
+ "student_at_r10_512": 0.6563999652862549,
22
+ "student_at_r10_768": 0.6570000052452087,
23
+ "student_at_r1_128": 0.29739999771118164,
24
+ "student_at_r1_1280": 0.3215999901294708,
25
+ "student_at_r1_256": 0.31439998745918274,
26
+ "student_at_r1_512": 0.3203999996185303,
27
+ "student_at_r1_768": 0.3215999901294708,
28
+ "student_at_r5_128": 0.5307999849319458,
29
+ "student_at_r5_1280": 0.5541999936103821,
30
+ "student_at_r5_256": 0.550000011920929,
31
+ "student_at_r5_512": 0.5527999997138977,
32
+ "student_at_r5_768": 0.5533999800682068,
33
+ "student_ta_r10_128": 0.649399995803833,
34
+ "student_ta_r10_1280": 0.6651999950408936,
35
+ "student_ta_r10_256": 0.6615999937057495,
36
+ "student_ta_r10_512": 0.6625999808311462,
37
+ "student_ta_r10_768": 0.663599967956543,
38
+ "student_ta_r1_128": 0.2793999910354614,
39
+ "student_ta_r1_1280": 0.3027999997138977,
40
+ "student_ta_r1_256": 0.29919999837875366,
41
+ "student_ta_r1_512": 0.30140000581741333,
42
+ "student_ta_r1_768": 0.30140000581741333,
43
+ "student_ta_r5_128": 0.5397999882698059,
44
+ "student_ta_r5_1280": 0.551800012588501,
45
+ "student_ta_r5_256": 0.5529999732971191,
46
+ "student_ta_r5_512": 0.5523999929428101,
47
+ "student_ta_r5_768": 0.5532000064849854,
48
+ "teacher_at_r10_128": 0.7107999920845032,
49
+ "teacher_at_r10_1280": 0.7335999608039856,
50
+ "teacher_at_r10_256": 0.7277999520301819,
51
+ "teacher_at_r10_512": 0.7299999594688416,
52
+ "teacher_at_r10_768": 0.7333999872207642,
53
+ "teacher_at_r1_128": 0.35819998383522034,
54
+ "teacher_at_r1_1280": 0.3946000039577484,
55
+ "teacher_at_r1_256": 0.3929999768733978,
56
+ "teacher_at_r1_512": 0.3953999876976013,
57
+ "teacher_at_r1_768": 0.3951999843120575,
58
+ "teacher_at_r5_128": 0.6187999844551086,
59
+ "teacher_at_r5_1280": 0.642799973487854,
60
+ "teacher_at_r5_256": 0.6367999911308289,
61
+ "teacher_at_r5_512": 0.640999972820282,
62
+ "teacher_at_r5_768": 0.6407999992370605
63
+ },
64
+ "lora": {
65
+ "alpha": 16.0,
66
+ "dropout": 0.0,
67
+ "rank": 0,
68
+ "targets": []
69
+ },
70
+ "merged_lora_source": "triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
71
+ "modalities": [
72
+ "audio",
73
+ "speech"
74
+ ],
75
+ "model_id": "AS-20M",
76
+ "size_millions_rounded": 20,
77
+ "source_checkpoint": "triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
78
+ "source_checkpoint_sha256": "f43003f4d8dbc1eaa0095e1f3cab608ecca3309e77f579e5078c269c899ade52"
79
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_mono": true,
3
+ "do_resample": true,
4
+ "feature_extractor_type": "EfficientATAugmentMelSTFT",
5
+ "freqm": 0,
6
+ "hopsize": 320,
7
+ "n_fft": 1024,
8
+ "n_mels": 128,
9
+ "sample_rate": 32000,
10
+ "timem": 0,
11
+ "win_length": 800
12
+ }