Publish AS-20M with MAEB audio-only comparison
Browse files- AS-20M.safetensors +3 -0
- README.md +145 -0
- config.json +35 -0
- manifest.json +79 -0
- preprocessor_config.json +12 -0
AS-20M.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:77152ea3cd4a9f841cb88230b829cce3fe68afa3d1eef7b41e01f32537859ca3
|
| 3 |
+
size 79578032
|
README.md
ADDED
|
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- audio
|
| 7 |
+
- speech
|
| 8 |
+
- embedding
|
| 9 |
+
- retrieval
|
| 10 |
+
- feature-extraction
|
| 11 |
+
- efficientat
|
| 12 |
+
- matryoshka
|
| 13 |
+
- memory-augmentation
|
| 14 |
+
library_name: pytorch
|
| 15 |
+
pipeline_tag: feature-extraction
|
| 16 |
+
datasets:
|
| 17 |
+
- custom
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# AS-20M
|
| 21 |
+
|
| 22 |
+
`AS-20M` is a standalone audio + speech embedding encoder for
|
| 23 |
+
human-memory augmentation workloads. It uses a native `mn20_as` EfficientAT
|
| 24 |
+
backbone with the speech/audio LoRA training merged into the released weights,
|
| 25 |
+
so inference does not require loading a separate adapter.
|
| 26 |
+
|
| 27 |
+
Canonical name:
|
| 28 |
+
|
| 29 |
+
- `AS` = audio + speech
|
| 30 |
+
- `20M` = 19,837,720 loaded parameters, rounded to integer millions
|
| 31 |
+
|
| 32 |
+
## Runtime Contract
|
| 33 |
+
|
| 34 |
+
Input is mono audio resampled to 32 kHz. The expected preprocessing is the
|
| 35 |
+
EfficientAT mel frontend used during training:
|
| 36 |
+
|
| 37 |
+
- sample rate: `32000`
|
| 38 |
+
- FFT: `1024`
|
| 39 |
+
- window length: `800`
|
| 40 |
+
- hop size: `320`
|
| 41 |
+
- mel bins: `128`
|
| 42 |
+
|
| 43 |
+
The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles,
|
| 44 |
+
truncate and renormalize:
|
| 45 |
+
|
| 46 |
+
```text
|
| 47 |
+
z1280 = l2norm(model(audio))
|
| 48 |
+
z768 = l2norm(z1280[0:768])
|
| 49 |
+
z512 = l2norm(z1280[0:512])
|
| 50 |
+
z256 = l2norm(z1280[0:256])
|
| 51 |
+
z128 = l2norm(z1280[0:128])
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Artifacts
|
| 55 |
+
|
| 56 |
+
- `AS-20M.safetensors`: standalone native EfficientAT embedding model
|
| 57 |
+
- `config.json`: release and architecture metadata
|
| 58 |
+
- `preprocessor_config.json`: waveform and mel frontend contract
|
| 59 |
+
- `manifest.json`: file hashes and source checkpoint lineage
|
| 60 |
+
|
| 61 |
+
## Training Summary
|
| 62 |
+
|
| 63 |
+
This checkpoint was continued from the balanced native `mn20_as` student and
|
| 64 |
+
trained on an audio-heavy mix of synthetic speech/audio alignment data. The
|
| 65 |
+
published artifact contains merged weights, not a runtime LoRA adapter.
|
| 66 |
+
|
| 67 |
+
Source checkpoint:
|
| 68 |
+
|
| 69 |
+
```text
|
| 70 |
+
triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
Merged LoRA source:
|
| 74 |
+
|
| 75 |
+
```text
|
| 76 |
+
triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Local Gate Metrics
|
| 80 |
+
|
| 81 |
+
The checkpoint-local heldout gate reported:
|
| 82 |
+
|
| 83 |
+
| Metric | Score |
|
| 84 |
+
|---|---:|
|
| 85 |
+
| audio cosine | 0.8108 |
|
| 86 |
+
| embedding Pearson | 0.7953 |
|
| 87 |
+
| similarity Pearson | 0.8853 |
|
| 88 |
+
| audio-to-text R@1, 1280d | 0.3216 |
|
| 89 |
+
| text-to-audio R@1, 1280d | 0.3028 |
|
| 90 |
+
|
| 91 |
+
## MAEB Audio-Only Comparison
|
| 92 |
+
|
| 93 |
+
This comparison uses the same 20 MAEB audio-only tasks for all three
|
| 94 |
+
standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded
|
| 95 |
+
because base `mn20_as` and Whisper-Tiny do not include a compatible text
|
| 96 |
+
encoder; no text adapters were invented for those baselines.
|
| 97 |
+
|
| 98 |
+
Validation: each run completed 20/20 tasks with `exception_count=0`.
|
| 99 |
+
|
| 100 |
+
| Model | Params | Native output | Mean primary |
|
| 101 |
+
|---|---:|---:|---:|
|
| 102 |
+
| base `mn20_as` | 17.9M | 1920d audio feature | 0.3977 |
|
| 103 |
+
| Whisper-Tiny encoder | 8.2M encoder / 37.8M full | 384d pooled encoder state | 0.3320 |
|
| 104 |
+
| `AS-20M` | 19.8M | 1280d embedding | 0.4083 |
|
| 105 |
+
|
| 106 |
+
| Task | base `mn20_as` | Whisper-Tiny | `AS-20M` |
|
| 107 |
+
|---|---:|---:|---:|
|
| 108 |
+
| BeijingOpera | 0.8470 | 0.5933 | 0.8349 |
|
| 109 |
+
| BirdCLEF | 0.2070 | 0.0730 | 0.1730 |
|
| 110 |
+
| CREMADPairClassification | 0.5458 | 0.5752 | 0.5475 |
|
| 111 |
+
| CREMA_D | 0.2804 | 0.2995 | 0.3351 |
|
| 112 |
+
| CREMA_DClustering | 0.0229 | 0.0955 | 0.0943 |
|
| 113 |
+
| CommonLanguageAgeDetection | 0.1401 | 0.2108 | 0.1799 |
|
| 114 |
+
| FSD2019Kaggle | 0.5734 | 0.0964 | 0.6230 |
|
| 115 |
+
| GTZANAudioReranking | 0.8298 | 0.6340 | 0.7747 |
|
| 116 |
+
| GTZANGenre | 0.8260 | 0.4550 | 0.7300 |
|
| 117 |
+
| IEMOCAPGender | 0.7790 | 0.5269 | 0.7712 |
|
| 118 |
+
| JamAltArtistA2ARetrieval | 0.8981 | 0.6786 | 0.8490 |
|
| 119 |
+
| MInDS14 | 0.0818 | 0.1057 | 0.0967 |
|
| 120 |
+
| MridinghamTonic | 0.3434 | 0.3080 | 0.3450 |
|
| 121 |
+
| NMSQAPairClassification | 0.4714 | 0.4360 | 0.5875 |
|
| 122 |
+
| SIBFLEURS | 0.1515 | 0.1554 | 0.1456 |
|
| 123 |
+
| VehicleSoundClustering | 0.0065 | 0.1194 | 0.0162 |
|
| 124 |
+
| VoxCelebSA | 0.2377 | 0.1673 | 0.2601 |
|
| 125 |
+
| VoxPopuliAccentPairClassification | 0.5158 | 0.5196 | 0.5235 |
|
| 126 |
+
| VoxPopuliGenderClustering | 0.0057 | 0.0008 | 0.0014 |
|
| 127 |
+
| VoxPopuliLanguageID | 0.1900 | 0.5900 | 0.2780 |
|
| 128 |
+
|
| 129 |
+
Interpretation: `AS-20M` is slightly ahead on the 20-task audio-only mean,
|
| 130 |
+
while base `mn20_as` remains stronger on several music/general-audio tasks.
|
| 131 |
+
Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is
|
| 132 |
+
not a general audio embedding model and is weaker on broad environmental-audio
|
| 133 |
+
coverage in this comparison.
|
| 134 |
+
|
| 135 |
+
Artifacts:
|
| 136 |
+
|
| 137 |
+
- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md`
|
| 138 |
+
- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json`
|
| 139 |
+
|
| 140 |
+
## Limitations
|
| 141 |
+
|
| 142 |
+
`AS-20M` is an embedding model only. It does not transcribe speech,
|
| 143 |
+
classify audio events directly, or include a text encoder in this standalone
|
| 144 |
+
release artifact. Text-audio retrieval evaluations use a separate compatible
|
| 145 |
+
text encoder/head to score cross-modal alignment.
|
config.json
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"EfficientATMn20ASNativeEmbedding"
|
| 4 |
+
],
|
| 5 |
+
"embedding_dim": 1280,
|
| 6 |
+
"matryoshka_dims": [
|
| 7 |
+
1280,
|
| 8 |
+
768,
|
| 9 |
+
512,
|
| 10 |
+
256,
|
| 11 |
+
128
|
| 12 |
+
],
|
| 13 |
+
"mel": {
|
| 14 |
+
"freqm": 0,
|
| 15 |
+
"hopsize": 320,
|
| 16 |
+
"n_fft": 1024,
|
| 17 |
+
"n_mels": 128,
|
| 18 |
+
"sample_rate": 32000,
|
| 19 |
+
"timem": 0,
|
| 20 |
+
"win_length": 800
|
| 21 |
+
},
|
| 22 |
+
"merged_lora_source": "triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
|
| 23 |
+
"modalities": [
|
| 24 |
+
"audio",
|
| 25 |
+
"speech"
|
| 26 |
+
],
|
| 27 |
+
"model_id": "AS-20M",
|
| 28 |
+
"model_type": "native_efficientat_audio_embedding",
|
| 29 |
+
"normalize_embeddings": true,
|
| 30 |
+
"parameter_count": 19837720,
|
| 31 |
+
"sample_rate": 32000,
|
| 32 |
+
"source_checkpoint_sha256": "f43003f4d8dbc1eaa0095e1f3cab608ecca3309e77f579e5078c269c899ade52",
|
| 33 |
+
"state_tensor_elements": 19886566,
|
| 34 |
+
"student_model": "mn20_as"
|
| 35 |
+
}
|
manifest.json
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"artifacts": {
|
| 3 |
+
"safetensors": {
|
| 4 |
+
"bytes": 79578032,
|
| 5 |
+
"parameter_count": 19837720,
|
| 6 |
+
"path": "AS-20M.safetensors",
|
| 7 |
+
"sha256": "77152ea3cd4a9f841cb88230b829cce3fe68afa3d1eef7b41e01f32537859ca3",
|
| 8 |
+
"state_tensor_count": 312,
|
| 9 |
+
"state_tensor_elements": 19886566
|
| 10 |
+
}
|
| 11 |
+
},
|
| 12 |
+
"canonical_name_rule": "<modality>-<size>, modalities sorted alphabetically",
|
| 13 |
+
"checkpoint_epoch": 4,
|
| 14 |
+
"checkpoint_metrics": {
|
| 15 |
+
"audio_cos": 0.8108276128768921,
|
| 16 |
+
"embed_pearson": 0.7953315377235413,
|
| 17 |
+
"sim_pearson": 0.88530433177948,
|
| 18 |
+
"student_at_r10_128": 0.6377999782562256,
|
| 19 |
+
"student_at_r10_1280": 0.6571999788284302,
|
| 20 |
+
"student_at_r10_256": 0.6527999639511108,
|
| 21 |
+
"student_at_r10_512": 0.6563999652862549,
|
| 22 |
+
"student_at_r10_768": 0.6570000052452087,
|
| 23 |
+
"student_at_r1_128": 0.29739999771118164,
|
| 24 |
+
"student_at_r1_1280": 0.3215999901294708,
|
| 25 |
+
"student_at_r1_256": 0.31439998745918274,
|
| 26 |
+
"student_at_r1_512": 0.3203999996185303,
|
| 27 |
+
"student_at_r1_768": 0.3215999901294708,
|
| 28 |
+
"student_at_r5_128": 0.5307999849319458,
|
| 29 |
+
"student_at_r5_1280": 0.5541999936103821,
|
| 30 |
+
"student_at_r5_256": 0.550000011920929,
|
| 31 |
+
"student_at_r5_512": 0.5527999997138977,
|
| 32 |
+
"student_at_r5_768": 0.5533999800682068,
|
| 33 |
+
"student_ta_r10_128": 0.649399995803833,
|
| 34 |
+
"student_ta_r10_1280": 0.6651999950408936,
|
| 35 |
+
"student_ta_r10_256": 0.6615999937057495,
|
| 36 |
+
"student_ta_r10_512": 0.6625999808311462,
|
| 37 |
+
"student_ta_r10_768": 0.663599967956543,
|
| 38 |
+
"student_ta_r1_128": 0.2793999910354614,
|
| 39 |
+
"student_ta_r1_1280": 0.3027999997138977,
|
| 40 |
+
"student_ta_r1_256": 0.29919999837875366,
|
| 41 |
+
"student_ta_r1_512": 0.30140000581741333,
|
| 42 |
+
"student_ta_r1_768": 0.30140000581741333,
|
| 43 |
+
"student_ta_r5_128": 0.5397999882698059,
|
| 44 |
+
"student_ta_r5_1280": 0.551800012588501,
|
| 45 |
+
"student_ta_r5_256": 0.5529999732971191,
|
| 46 |
+
"student_ta_r5_512": 0.5523999929428101,
|
| 47 |
+
"student_ta_r5_768": 0.5532000064849854,
|
| 48 |
+
"teacher_at_r10_128": 0.7107999920845032,
|
| 49 |
+
"teacher_at_r10_1280": 0.7335999608039856,
|
| 50 |
+
"teacher_at_r10_256": 0.7277999520301819,
|
| 51 |
+
"teacher_at_r10_512": 0.7299999594688416,
|
| 52 |
+
"teacher_at_r10_768": 0.7333999872207642,
|
| 53 |
+
"teacher_at_r1_128": 0.35819998383522034,
|
| 54 |
+
"teacher_at_r1_1280": 0.3946000039577484,
|
| 55 |
+
"teacher_at_r1_256": 0.3929999768733978,
|
| 56 |
+
"teacher_at_r1_512": 0.3953999876976013,
|
| 57 |
+
"teacher_at_r1_768": 0.3951999843120575,
|
| 58 |
+
"teacher_at_r5_128": 0.6187999844551086,
|
| 59 |
+
"teacher_at_r5_1280": 0.642799973487854,
|
| 60 |
+
"teacher_at_r5_256": 0.6367999911308289,
|
| 61 |
+
"teacher_at_r5_512": 0.640999972820282,
|
| 62 |
+
"teacher_at_r5_768": 0.6407999992370605
|
| 63 |
+
},
|
| 64 |
+
"lora": {
|
| 65 |
+
"alpha": 16.0,
|
| 66 |
+
"dropout": 0.0,
|
| 67 |
+
"rank": 0,
|
| 68 |
+
"targets": []
|
| 69 |
+
},
|
| 70 |
+
"merged_lora_source": "triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
|
| 71 |
+
"modalities": [
|
| 72 |
+
"audio",
|
| 73 |
+
"speech"
|
| 74 |
+
],
|
| 75 |
+
"model_id": "AS-20M",
|
| 76 |
+
"size_millions_rounded": 20,
|
| 77 |
+
"source_checkpoint": "triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
|
| 78 |
+
"source_checkpoint_sha256": "f43003f4d8dbc1eaa0095e1f3cab608ecca3309e77f579e5078c269c899ade52"
|
| 79 |
+
}
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"do_convert_mono": true,
|
| 3 |
+
"do_resample": true,
|
| 4 |
+
"feature_extractor_type": "EfficientATAugmentMelSTFT",
|
| 5 |
+
"freqm": 0,
|
| 6 |
+
"hopsize": 320,
|
| 7 |
+
"n_fft": 1024,
|
| 8 |
+
"n_mels": 128,
|
| 9 |
+
"sample_rate": 32000,
|
| 10 |
+
"timem": 0,
|
| 11 |
+
"win_length": 800
|
| 12 |
+
}
|