Publish AS-20M with MAEB audio-only comparison

Browse files

Files changed (5) hide show

AS-20M.safetensors +3 -0
README.md +145 -0
config.json +35 -0
manifest.json +79 -0
preprocessor_config.json +12 -0

AS-20M.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:77152ea3cd4a9f841cb88230b829cce3fe68afa3d1eef7b41e01f32537859ca3
+size 79578032

README.md ADDED Viewed

	@@ -0,0 +1,145 @@

+---
+language:
+- en
+license: apache-2.0
+tags:
+- audio
+- speech
+- embedding
+- retrieval
+- feature-extraction
+- efficientat
+- matryoshka
+- memory-augmentation
+library_name: pytorch
+pipeline_tag: feature-extraction
+datasets:
+- custom
+---
+# AS-20M
+`AS-20M` is a standalone audio + speech embedding encoder for
+human-memory augmentation workloads. It uses a native `mn20_as` EfficientAT
+backbone with the speech/audio LoRA training merged into the released weights,
+so inference does not require loading a separate adapter.
+Canonical name:
+- `AS` = audio + speech
+- `20M` = 19,837,720 loaded parameters, rounded to integer millions
+## Runtime Contract
+Input is mono audio resampled to 32 kHz. The expected preprocessing is the
+EfficientAT mel frontend used during training:
+- sample rate: `32000`
+- FFT: `1024`
+- window length: `800`
+- hop size: `320`
+- mel bins: `128`
+The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles,
+truncate and renormalize:
+```text
+z1280 = l2norm(model(audio))
+z768  = l2norm(z1280[0:768])
+z512  = l2norm(z1280[0:512])
+z256  = l2norm(z1280[0:256])
+z128  = l2norm(z1280[0:128])
+```
+## Artifacts
+- `AS-20M.safetensors`: standalone native EfficientAT embedding model
+- `config.json`: release and architecture metadata
+- `preprocessor_config.json`: waveform and mel frontend contract
+- `manifest.json`: file hashes and source checkpoint lineage
+## Training Summary
+This checkpoint was continued from the balanced native `mn20_as` student and
+trained on an audio-heavy mix of synthetic speech/audio alignment data. The
+published artifact contains merged weights, not a runtime LoRA adapter.
+Source checkpoint:
+```text
+triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
+```
+Merged LoRA source:
+```text
+triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
+```
+## Local Gate Metrics
+The checkpoint-local heldout gate reported:
+| Metric | Score |
+|---|---:|
+| audio cosine | 0.8108 |
+| embedding Pearson | 0.7953 |
+| similarity Pearson | 0.8853 |
+| audio-to-text R@1, 1280d | 0.3216 |
+| text-to-audio R@1, 1280d | 0.3028 |
+## MAEB Audio-Only Comparison
+This comparison uses the same 20 MAEB audio-only tasks for all three
+standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded
+because base `mn20_as` and Whisper-Tiny do not include a compatible text
+encoder; no text adapters were invented for those baselines.
+Validation: each run completed 20/20 tasks with `exception_count=0`.
+| Model | Params | Native output | Mean primary |
+|---|---:|---:|---:|
+| base `mn20_as` | 17.9M | 1920d audio feature | 0.3977 |
+| Whisper-Tiny encoder | 8.2M encoder / 37.8M full | 384d pooled encoder state | 0.3320 |
+| `AS-20M` | 19.8M | 1280d embedding | 0.4083 |
+| Task | base `mn20_as` | Whisper-Tiny | `AS-20M` |
+|---|---:|---:|---:|
+| BeijingOpera | 0.8470 | 0.5933 | 0.8349 |
+| BirdCLEF | 0.2070 | 0.0730 | 0.1730 |
+| CREMADPairClassification | 0.5458 | 0.5752 | 0.5475 |
+| CREMA_D | 0.2804 | 0.2995 | 0.3351 |
+| CREMA_DClustering | 0.0229 | 0.0955 | 0.0943 |
+| CommonLanguageAgeDetection | 0.1401 | 0.2108 | 0.1799 |
+| FSD2019Kaggle | 0.5734 | 0.0964 | 0.6230 |
+| GTZANAudioReranking | 0.8298 | 0.6340 | 0.7747 |
+| GTZANGenre | 0.8260 | 0.4550 | 0.7300 |
+| IEMOCAPGender | 0.7790 | 0.5269 | 0.7712 |
+| JamAltArtistA2ARetrieval | 0.8981 | 0.6786 | 0.8490 |
+| MInDS14 | 0.0818 | 0.1057 | 0.0967 |
+| MridinghamTonic | 0.3434 | 0.3080 | 0.3450 |
+| NMSQAPairClassification | 0.4714 | 0.4360 | 0.5875 |
+| SIBFLEURS | 0.1515 | 0.1554 | 0.1456 |
+| VehicleSoundClustering | 0.0065 | 0.1194 | 0.0162 |
+| VoxCelebSA | 0.2377 | 0.1673 | 0.2601 |
+| VoxPopuliAccentPairClassification | 0.5158 | 0.5196 | 0.5235 |
+| VoxPopuliGenderClustering | 0.0057 | 0.0008 | 0.0014 |
+| VoxPopuliLanguageID | 0.1900 | 0.5900 | 0.2780 |
+Interpretation: `AS-20M` is slightly ahead on the 20-task audio-only mean,
+while base `mn20_as` remains stronger on several music/general-audio tasks.
+Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is
+not a general audio embedding model and is weaker on broad environmental-audio
+coverage in this comparison.
+Artifacts:
+- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md`
+- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json`
+## Limitations
+`AS-20M` is an embedding model only. It does not transcribe speech,
+classify audio events directly, or include a text encoder in this standalone
+release artifact. Text-audio retrieval evaluations use a separate compatible
+text encoder/head to score cross-modal alignment.

config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "architectures": [
+    "EfficientATMn20ASNativeEmbedding"
+  ],
+  "embedding_dim": 1280,
+  "matryoshka_dims": [
+    1280,
+    768,
+    512,
+    256,
+    128
+  ],
+  "mel": {
+    "freqm": 0,
+    "hopsize": 320,
+    "n_fft": 1024,
+    "n_mels": 128,
+    "sample_rate": 32000,
+    "timem": 0,
+    "win_length": 800
+  },
+  "merged_lora_source": "triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
+  "modalities": [
+    "audio",
+    "speech"
+  ],
+  "model_id": "AS-20M",
+  "model_type": "native_efficientat_audio_embedding",
+  "normalize_embeddings": true,
+  "parameter_count": 19837720,
+  "sample_rate": 32000,
+  "source_checkpoint_sha256": "f43003f4d8dbc1eaa0095e1f3cab608ecca3309e77f579e5078c269c899ade52",
+  "state_tensor_elements": 19886566,
+  "student_model": "mn20_as"
+}

manifest.json ADDED Viewed

	@@ -0,0 +1,79 @@

+{
+  "artifacts": {
+    "safetensors": {
+      "bytes": 79578032,
+      "parameter_count": 19837720,
+      "path": "AS-20M.safetensors",
+      "sha256": "77152ea3cd4a9f841cb88230b829cce3fe68afa3d1eef7b41e01f32537859ca3",
+      "state_tensor_count": 312,
+      "state_tensor_elements": 19886566
+    }
+  },
+  "canonical_name_rule": "<modality>-<size>, modalities sorted alphabetically",
+  "checkpoint_epoch": 4,
+  "checkpoint_metrics": {
+    "audio_cos": 0.8108276128768921,
+    "embed_pearson": 0.7953315377235413,
+    "sim_pearson": 0.88530433177948,
+    "student_at_r10_128": 0.6377999782562256,
+    "student_at_r10_1280": 0.6571999788284302,
+    "student_at_r10_256": 0.6527999639511108,
+    "student_at_r10_512": 0.6563999652862549,
+    "student_at_r10_768": 0.6570000052452087,
+    "student_at_r1_128": 0.29739999771118164,
+    "student_at_r1_1280": 0.3215999901294708,
+    "student_at_r1_256": 0.31439998745918274,
+    "student_at_r1_512": 0.3203999996185303,
+    "student_at_r1_768": 0.3215999901294708,
+    "student_at_r5_128": 0.5307999849319458,
+    "student_at_r5_1280": 0.5541999936103821,
+    "student_at_r5_256": 0.550000011920929,
+    "student_at_r5_512": 0.5527999997138977,
+    "student_at_r5_768": 0.5533999800682068,
+    "student_ta_r10_128": 0.649399995803833,
+    "student_ta_r10_1280": 0.6651999950408936,
+    "student_ta_r10_256": 0.6615999937057495,
+    "student_ta_r10_512": 0.6625999808311462,
+    "student_ta_r10_768": 0.663599967956543,
+    "student_ta_r1_128": 0.2793999910354614,
+    "student_ta_r1_1280": 0.3027999997138977,
+    "student_ta_r1_256": 0.29919999837875366,
+    "student_ta_r1_512": 0.30140000581741333,
+    "student_ta_r1_768": 0.30140000581741333,
+    "student_ta_r5_128": 0.5397999882698059,
+    "student_ta_r5_1280": 0.551800012588501,
+    "student_ta_r5_256": 0.5529999732971191,
+    "student_ta_r5_512": 0.5523999929428101,
+    "student_ta_r5_768": 0.5532000064849854,
+    "teacher_at_r10_128": 0.7107999920845032,
+    "teacher_at_r10_1280": 0.7335999608039856,
+    "teacher_at_r10_256": 0.7277999520301819,
+    "teacher_at_r10_512": 0.7299999594688416,
+    "teacher_at_r10_768": 0.7333999872207642,
+    "teacher_at_r1_128": 0.35819998383522034,
+    "teacher_at_r1_1280": 0.3946000039577484,
+    "teacher_at_r1_256": 0.3929999768733978,
+    "teacher_at_r1_512": 0.3953999876976013,
+    "teacher_at_r1_768": 0.3951999843120575,
+    "teacher_at_r5_128": 0.6187999844551086,
+    "teacher_at_r5_1280": 0.642799973487854,
+    "teacher_at_r5_256": 0.6367999911308289,
+    "teacher_at_r5_512": 0.640999972820282,
+    "teacher_at_r5_768": 0.6407999992370605
+  },
+  "lora": {
+    "alpha": 16.0,
+    "dropout": 0.0,
+    "rank": 0,
+    "targets": []
+  },
+  "merged_lora_source": "triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
+  "modalities": [
+    "audio",
+    "speech"
+  ],
+  "model_id": "AS-20M",
+  "size_millions_rounded": 20,
+  "source_checkpoint": "triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt",
+  "source_checkpoint_sha256": "f43003f4d8dbc1eaa0095e1f3cab608ecca3309e77f579e5078c269c899ade52"
+}

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "do_convert_mono": true,
+  "do_resample": true,
+  "feature_extractor_type": "EfficientATAugmentMelSTFT",
+  "freqm": 0,
+  "hopsize": 320,
+  "n_fft": 1024,
+  "n_mels": 128,
+  "sample_rate": 32000,
+  "timem": 0,
+  "win_length": 800
+}