gcoderw commited on
Commit
c85bf8a
·
verified ·
1 Parent(s): 184a76c

Publish AIST-87M human-memory embedding model

Browse files
.gitattributes CHANGED
@@ -1,35 +1,2 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  *.safetensors filter=lfs diff=lfs merge=lfs -text
2
+ *.gguf filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
AIST-87M.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c93839ee0875e75b4dbb91ff510bb736122d666631f50902f6e587f158ebd7ec
3
+ size 348855664
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - multimodal
7
+ - embedding
8
+ - trimodal
9
+ - retrieval
10
+ - image-text-audio
11
+ - audio
12
+ - speech
13
+ - memory-augmentation
14
+ - feature-extraction
15
+ library_name: pytorch
16
+ pipeline_tag: feature-extraction
17
+ datasets:
18
+ - custom
19
+ ---
20
+
21
+ # AIST-87M
22
+
23
+ `AIST-87M` is a compact audio + image + speech + text embedding model for
24
+ human-memory augmentation workloads.
25
+
26
+ It is the single-audio evolution of the earlier dual-audio tower line: the
27
+ runtime audio path uses one merged native `mn20_as` EfficientAT encoder instead
28
+ of a separate EfficientAT + Whisper dual branch. The LoRA training weights are
29
+ merged into the native audio encoder in this release artifact, so there is no
30
+ separate LoRA pass at inference time.
31
+
32
+ Core stack:
33
+
34
+ - text: `MongoDB/mdbr-leaf-ir`
35
+ - image: `mobilenetv4_conv_medium.e180_r384_in12k`
36
+ - audio: native merged `mn20_as` EfficientAT encoder
37
+ - projection output: `1280d`
38
+ - Matryoshka slices: `[1280, 768, 512, 256, 128]`
39
+ - exact loaded params: `87,118,774`
40
+
41
+ The canonical name follows the Augmem naming standard:
42
+
43
+ - `AIST` = audio + image + speech + text
44
+ - `87M` = exact loaded parameter count rounded to integer millions
45
+
46
+ ## Runtime Contract
47
+
48
+ This model returns L2-normalized embeddings in a shared 1280-dimensional space.
49
+ For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:
50
+
51
+ ```text
52
+ z1280 = l2norm(model(input))
53
+ z768 = l2norm(z1280[0:768])
54
+ z512 = l2norm(z1280[0:512])
55
+ ```
56
+
57
+ The release safetensors file is self-contained and includes the text encoder,
58
+ image encoder, merged native audio encoder, and the three projection heads.
59
+
60
+ ## Evaluation Scope
61
+
62
+ This release uses a human-memory evaluation slice rather than a broad
63
+ leaderboard sweep. The slice is chosen to match practical memory augmentation
64
+ surfaces:
65
+
66
+ - text continuity: duplicate-question and semantic textual similarity tasks
67
+ - image recall: Flickr30k text-image and image-text retrieval
68
+ - audio recall: speech/general-audio text-audio retrieval tasks
69
+
70
+ Primary metrics:
71
+
72
+ - text continuity: `main_score`
73
+ - image recall: `NDCG@10`
74
+ - audio recall: `NDCG@10`
75
+
76
+ ## Human-Memory Slice
77
+
78
+ Source: `aist87m_memory_slice_release_report.md` and
79
+ `aist87m_memory_slice_release_report.json`.
80
+
81
+ | Dim | Tasks | Text continuity | Image recall | Audio recall | Overall |
82
+ |---:|---:|---:|---:|---:|---:|
83
+ | 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 |
84
+ | 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
85
+ | 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
86
+
87
+ Selected 1280d task scores:
88
+
89
+ | Task | Family | Metric | Score | R@1 | R@10 |
90
+ |---|---|---|---:|---:|---:|
91
+ | SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - |
92
+ | STSBenchmark | Text continuity | main_score | 0.651 | - | - |
93
+ | Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 |
94
+ | Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 |
95
+ | CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 |
96
+ | MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 |
97
+ | UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 |
98
+ | ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 |
99
+
100
+ ## Task-Aligned Comparisons
101
+
102
+ Comparisons below are only for locally available, task-aligned runs.
103
+
104
+ | Comparison | Dim | Paired tasks | Read |
105
+ |---|---:|---:|---|
106
+ | vs `ES-AIST-81M` | 768 | 8 | lower text continuity; stronger Flickr and selected audio recall |
107
+ | vs native `mn20_as` audio baseline | 768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat |
108
+ | vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores |
109
+ | vs `AIST-95M` | 1280 | 2 | only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair |
110
+
111
+ This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model.
112
+ Broad diagnostic runs contain many task families that are not part of this
113
+ release gate.
114
+
115
+ ## Architecture
116
+
117
+ ```text
118
+ Text -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
119
+ Image -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
120
+ Audio -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280
121
+ ```
122
+
123
+ The audio encoder in this artifact is the merged native checkpoint:
124
+
125
+ `mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt`
126
+
127
+ ## Parameter Count
128
+
129
+ | Component | Params |
130
+ |---|---:|
131
+ | Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 |
132
+ | Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 |
133
+ | Audio encoder (merged native `mn20_as`) | 19,886,566 |
134
+ | Image projection head | 12,306,560 |
135
+ | Audio projection head | 12,306,560 |
136
+ | Text projection head | 11,323,520 |
137
+ | **Total exact loaded params** | **87,118,774** |
138
+
139
+ ## Files
140
+
141
+ | File | Purpose |
142
+ |---|---|
143
+ | `AIST-87M.safetensors` | Self-contained release artifact |
144
+ | `aist_81m_raw_mn20_lora.yaml` | Training recipe for the source run |
145
+ | `parameter_breakdown.json` | Exact parameter accounting |
146
+ | `aist87m_memory_slice_release_report.md` | Human-memory slice report |
147
+ | `aist87m_memory_slice_release_report.json` | Machine-readable evaluation summary |
148
+
149
+ ## Caveats
150
+
151
+ - The model is optimized and reported for memory-relevant embedding surfaces,
152
+ not broad leaderboard coverage.
153
+ - The single-audio path is smaller and simpler than the dual-audio tower, but
154
+ it does not dominate the dual-audio tower on paired diagnostic scores.
155
+ - 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.
aist87m_memory_slice_release_report.json ADDED
@@ -0,0 +1,1275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tasks": [
3
+ "SprintDuplicateQuestions",
4
+ "STSBenchmark",
5
+ "Flickr30kT2IRetrieval",
6
+ "Flickr30kI2TRetrieval",
7
+ "CommonVoiceMini21T2ARetrieval",
8
+ "MACST2ARetrieval",
9
+ "UrbanSound8KT2ARetrieval",
10
+ "ClothoT2ARetrieval"
11
+ ],
12
+ "primary_metric_policy": {
13
+ "text": "main_score",
14
+ "image_text_retrieval": "ndcg_at_10",
15
+ "audio_text_retrieval": "ndcg_at_10"
16
+ },
17
+ "runs": [
18
+ {
19
+ "label": "AIST-87M 1280",
20
+ "dimension": 1280,
21
+ "results_dir": "/shared/augmem/triembed/results/aist81m_raw1280_mn20_merged_teacher_20260503T0125Z_memory_slice_default/dim1280/results/triembed__te-1280d/best_model",
22
+ "completed_tasks": 8,
23
+ "missing_tasks": [],
24
+ "overall_mean": 0.3491338344017094,
25
+ "family_means": {
26
+ "Audio recall": 0.1041809188034188,
27
+ "Image recall": 0.424995,
28
+ "Text continuity": 0.7631785
29
+ },
30
+ "rows": [
31
+ {
32
+ "label": "AIST-87M 1280",
33
+ "dimension": 1280,
34
+ "task": "SprintDuplicateQuestions",
35
+ "family": "Text continuity",
36
+ "primary_metric": "main_score",
37
+ "primary": 0.875145,
38
+ "metrics": {
39
+ "main_score": 0.875145
40
+ },
41
+ "subsets": 1
42
+ },
43
+ {
44
+ "label": "AIST-87M 1280",
45
+ "dimension": 1280,
46
+ "task": "STSBenchmark",
47
+ "family": "Text continuity",
48
+ "primary_metric": "main_score",
49
+ "primary": 0.651212,
50
+ "metrics": {
51
+ "main_score": 0.651212,
52
+ "cosine_spearman": 0.651212,
53
+ "spearman": 0.651212
54
+ },
55
+ "subsets": 1
56
+ },
57
+ {
58
+ "label": "AIST-87M 1280",
59
+ "dimension": 1280,
60
+ "task": "Flickr30kT2IRetrieval",
61
+ "family": "Image recall",
62
+ "primary_metric": "ndcg_at_10",
63
+ "primary": 0.4685,
64
+ "metrics": {
65
+ "main_score": 0.4685,
66
+ "ndcg_at_10": 0.4685,
67
+ "recall_at_1": 0.2956,
68
+ "recall_at_10": 0.6718,
69
+ "mrr_at_10": 0.405197
70
+ },
71
+ "subsets": 1
72
+ },
73
+ {
74
+ "label": "AIST-87M 1280",
75
+ "dimension": 1280,
76
+ "task": "Flickr30kI2TRetrieval",
77
+ "family": "Image recall",
78
+ "primary_metric": "ndcg_at_10",
79
+ "primary": 0.38149,
80
+ "metrics": {
81
+ "main_score": 0.38149,
82
+ "ndcg_at_10": 0.38149,
83
+ "recall_at_1": 0.0816,
84
+ "recall_at_10": 0.4072,
85
+ "mrr_at_10": 0.533862
86
+ },
87
+ "subsets": 1
88
+ },
89
+ {
90
+ "label": "AIST-87M 1280",
91
+ "dimension": 1280,
92
+ "task": "CommonVoiceMini21T2ARetrieval",
93
+ "family": "Audio recall",
94
+ "primary_metric": "ndcg_at_10",
95
+ "primary": 0.028403675213675213,
96
+ "metrics": {
97
+ "main_score": 0.03276290598290598,
98
+ "ndcg_at_10": 0.028403675213675213,
99
+ "recall_at_1": 0.005908376068376069,
100
+ "recall_at_10": 0.061962393162393166,
101
+ "mrr_at_10": 0.01842434188034188
102
+ },
103
+ "subsets": 117
104
+ },
105
+ {
106
+ "label": "AIST-87M 1280",
107
+ "dimension": 1280,
108
+ "task": "MACST2ARetrieval",
109
+ "family": "Audio recall",
110
+ "primary_metric": "ndcg_at_10",
111
+ "primary": 0.11037,
112
+ "metrics": {
113
+ "main_score": 0.13995,
114
+ "ndcg_at_10": 0.11037,
115
+ "recall_at_1": 0.03308,
116
+ "recall_at_10": 0.21374,
117
+ "mrr_at_10": 0.079078
118
+ },
119
+ "subsets": 1
120
+ },
121
+ {
122
+ "label": "AIST-87M 1280",
123
+ "dimension": 1280,
124
+ "task": "UrbanSound8KT2ARetrieval",
125
+ "family": "Audio recall",
126
+ "primary_metric": "ndcg_at_10",
127
+ "primary": 0.00851,
128
+ "metrics": {
129
+ "main_score": 0.00963,
130
+ "ndcg_at_10": 0.00851,
131
+ "recall_at_1": 0.00196,
132
+ "recall_at_10": 0.01847,
133
+ "mrr_at_10": 0.00556
134
+ },
135
+ "subsets": 1
136
+ },
137
+ {
138
+ "label": "AIST-87M 1280",
139
+ "dimension": 1280,
140
+ "task": "ClothoT2ARetrieval",
141
+ "family": "Audio recall",
142
+ "primary_metric": "ndcg_at_10",
143
+ "primary": 0.26944,
144
+ "metrics": {
145
+ "main_score": 0.3325,
146
+ "ndcg_at_10": 0.26944,
147
+ "recall_at_1": 0.1282,
148
+ "recall_at_10": 0.44315,
149
+ "mrr_at_10": 0.215861
150
+ },
151
+ "subsets": 1
152
+ }
153
+ ],
154
+ "source_result_dirs": [
155
+ "/shared/augmem/triembed/results/aist81m_raw1280_mn20_merged_teacher_20260503T0125Z_memory_slice_default/dim1280/results/triembed__te-1280d/best_model"
156
+ ]
157
+ },
158
+ {
159
+ "label": "AIST-87M 768",
160
+ "dimension": 768,
161
+ "results_dir": "/shared/augmem/triembed/results/aist81m_raw1280_mn20_merged_teacher_20260503T0125Z_memory_slice_default/dim768/results/triembed__te-768d/best_model",
162
+ "completed_tasks": 8,
163
+ "missing_tasks": [],
164
+ "overall_mean": 0.34871195512820513,
165
+ "family_means": {
166
+ "Audio recall": 0.10426891025641025,
167
+ "Image recall": 0.423815,
168
+ "Text continuity": 0.7624949999999999
169
+ },
170
+ "rows": [
171
+ {
172
+ "label": "AIST-87M 768",
173
+ "dimension": 768,
174
+ "task": "SprintDuplicateQuestions",
175
+ "family": "Text continuity",
176
+ "primary_metric": "main_score",
177
+ "primary": 0.874231,
178
+ "metrics": {
179
+ "main_score": 0.874231
180
+ },
181
+ "subsets": 1
182
+ },
183
+ {
184
+ "label": "AIST-87M 768",
185
+ "dimension": 768,
186
+ "task": "STSBenchmark",
187
+ "family": "Text continuity",
188
+ "primary_metric": "main_score",
189
+ "primary": 0.650759,
190
+ "metrics": {
191
+ "main_score": 0.650759,
192
+ "cosine_spearman": 0.650759,
193
+ "spearman": 0.650759
194
+ },
195
+ "subsets": 1
196
+ },
197
+ {
198
+ "label": "AIST-87M 768",
199
+ "dimension": 768,
200
+ "task": "Flickr30kT2IRetrieval",
201
+ "family": "Image recall",
202
+ "primary_metric": "ndcg_at_10",
203
+ "primary": 0.46701,
204
+ "metrics": {
205
+ "main_score": 0.46701,
206
+ "ndcg_at_10": 0.46701,
207
+ "recall_at_1": 0.2922,
208
+ "recall_at_10": 0.6712,
209
+ "mrr_at_10": 0.403385
210
+ },
211
+ "subsets": 1
212
+ },
213
+ {
214
+ "label": "AIST-87M 768",
215
+ "dimension": 768,
216
+ "task": "Flickr30kI2TRetrieval",
217
+ "family": "Image recall",
218
+ "primary_metric": "ndcg_at_10",
219
+ "primary": 0.38062,
220
+ "metrics": {
221
+ "main_score": 0.38062,
222
+ "ndcg_at_10": 0.38062,
223
+ "recall_at_1": 0.0814,
224
+ "recall_at_10": 0.4058,
225
+ "mrr_at_10": 0.532687
226
+ },
227
+ "subsets": 1
228
+ },
229
+ {
230
+ "label": "AIST-87M 768",
231
+ "dimension": 768,
232
+ "task": "CommonVoiceMini21T2ARetrieval",
233
+ "family": "Audio recall",
234
+ "primary_metric": "ndcg_at_10",
235
+ "primary": 0.028395641025641027,
236
+ "metrics": {
237
+ "main_score": 0.03299991452991453,
238
+ "ndcg_at_10": 0.028395641025641027,
239
+ "recall_at_1": 0.005907350427350427,
240
+ "recall_at_10": 0.062035897435897436,
241
+ "mrr_at_10": 0.01839460683760684
242
+ },
243
+ "subsets": 117
244
+ },
245
+ {
246
+ "label": "AIST-87M 768",
247
+ "dimension": 768,
248
+ "task": "MACST2ARetrieval",
249
+ "family": "Audio recall",
250
+ "primary_metric": "ndcg_at_10",
251
+ "primary": 0.11149,
252
+ "metrics": {
253
+ "main_score": 0.14249,
254
+ "ndcg_at_10": 0.11149,
255
+ "recall_at_1": 0.03308,
256
+ "recall_at_10": 0.21628,
257
+ "mrr_at_10": 0.079723
258
+ },
259
+ "subsets": 1
260
+ },
261
+ {
262
+ "label": "AIST-87M 768",
263
+ "dimension": 768,
264
+ "task": "UrbanSound8KT2ARetrieval",
265
+ "family": "Audio recall",
266
+ "primary_metric": "ndcg_at_10",
267
+ "primary": 0.00851,
268
+ "metrics": {
269
+ "main_score": 0.00963,
270
+ "ndcg_at_10": 0.00851,
271
+ "recall_at_1": 0.00196,
272
+ "recall_at_10": 0.01847,
273
+ "mrr_at_10": 0.005562
274
+ },
275
+ "subsets": 1
276
+ },
277
+ {
278
+ "label": "AIST-87M 768",
279
+ "dimension": 768,
280
+ "task": "ClothoT2ARetrieval",
281
+ "family": "Audio recall",
282
+ "primary_metric": "ndcg_at_10",
283
+ "primary": 0.26868,
284
+ "metrics": {
285
+ "main_score": 0.33178,
286
+ "ndcg_at_10": 0.26868,
287
+ "recall_at_1": 0.12695,
288
+ "recall_at_10": 0.44208,
289
+ "mrr_at_10": 0.21516
290
+ },
291
+ "subsets": 1
292
+ }
293
+ ],
294
+ "source_result_dirs": [
295
+ "/shared/augmem/triembed/results/aist81m_raw1280_mn20_merged_teacher_20260503T0125Z_memory_slice_default/dim768/results/triembed__te-768d/best_model",
296
+ "/shared/augmem/triembed/results/aist81m_raw1280_mn20_merged_teacher_20260503T0125Z_memory_slice_dim768_fill/dim768/results/triembed__te-768d/best_model"
297
+ ]
298
+ },
299
+ {
300
+ "label": "AIST-87M 512",
301
+ "dimension": 512,
302
+ "results_dir": "/shared/augmem/triembed/results/aist81m_raw1280_mn20_merged_teacher_20260503T0125Z_memory_slice_dim512/dim512/results/triembed__te-512d/best_model",
303
+ "completed_tasks": 8,
304
+ "missing_tasks": [],
305
+ "overall_mean": 0.3488224732905983,
306
+ "family_means": {
307
+ "Audio recall": 0.10438869658119658,
308
+ "Image recall": 0.42417499999999997,
309
+ "Text continuity": 0.7623375
310
+ },
311
+ "rows": [
312
+ {
313
+ "label": "AIST-87M 512",
314
+ "dimension": 512,
315
+ "task": "SprintDuplicateQuestions",
316
+ "family": "Text continuity",
317
+ "primary_metric": "main_score",
318
+ "primary": 0.873508,
319
+ "metrics": {
320
+ "main_score": 0.873508
321
+ },
322
+ "subsets": 1
323
+ },
324
+ {
325
+ "label": "AIST-87M 512",
326
+ "dimension": 512,
327
+ "task": "STSBenchmark",
328
+ "family": "Text continuity",
329
+ "primary_metric": "main_score",
330
+ "primary": 0.651167,
331
+ "metrics": {
332
+ "main_score": 0.651167,
333
+ "cosine_spearman": 0.651167,
334
+ "spearman": 0.651167
335
+ },
336
+ "subsets": 1
337
+ },
338
+ {
339
+ "label": "AIST-87M 512",
340
+ "dimension": 512,
341
+ "task": "Flickr30kT2IRetrieval",
342
+ "family": "Image recall",
343
+ "primary_metric": "ndcg_at_10",
344
+ "primary": 0.4676,
345
+ "metrics": {
346
+ "main_score": 0.4676,
347
+ "ndcg_at_10": 0.4676,
348
+ "recall_at_1": 0.2954,
349
+ "recall_at_10": 0.6702,
350
+ "mrr_at_10": 0.404515
351
+ },
352
+ "subsets": 1
353
+ },
354
+ {
355
+ "label": "AIST-87M 512",
356
+ "dimension": 512,
357
+ "task": "Flickr30kI2TRetrieval",
358
+ "family": "Image recall",
359
+ "primary_metric": "ndcg_at_10",
360
+ "primary": 0.38075,
361
+ "metrics": {
362
+ "main_score": 0.38075,
363
+ "ndcg_at_10": 0.38075,
364
+ "recall_at_1": 0.0824,
365
+ "recall_at_10": 0.4052,
366
+ "mrr_at_10": 0.535146
367
+ },
368
+ "subsets": 1
369
+ },
370
+ {
371
+ "label": "AIST-87M 512",
372
+ "dimension": 512,
373
+ "task": "CommonVoiceMini21T2ARetrieval",
374
+ "family": "Audio recall",
375
+ "primary_metric": "ndcg_at_10",
376
+ "primary": 0.028264786324786326,
377
+ "metrics": {
378
+ "main_score": 0.03229504273504274,
379
+ "ndcg_at_10": 0.028264786324786326,
380
+ "recall_at_1": 0.006467948717948718,
381
+ "recall_at_10": 0.060837521367521366,
382
+ "mrr_at_10": 0.018573598290598292
383
+ },
384
+ "subsets": 117
385
+ },
386
+ {
387
+ "label": "AIST-87M 512",
388
+ "dimension": 512,
389
+ "task": "MACST2ARetrieval",
390
+ "family": "Audio recall",
391
+ "primary_metric": "ndcg_at_10",
392
+ "primary": 0.11287,
393
+ "metrics": {
394
+ "main_score": 0.13486,
395
+ "ndcg_at_10": 0.11287,
396
+ "recall_at_1": 0.03308,
397
+ "recall_at_10": 0.22137,
398
+ "mrr_at_10": 0.080181
399
+ },
400
+ "subsets": 1
401
+ },
402
+ {
403
+ "label": "AIST-87M 512",
404
+ "dimension": 512,
405
+ "task": "UrbanSound8KT2ARetrieval",
406
+ "family": "Audio recall",
407
+ "primary_metric": "ndcg_at_10",
408
+ "primary": 0.0085,
409
+ "metrics": {
410
+ "main_score": 0.00923,
411
+ "ndcg_at_10": 0.0085,
412
+ "recall_at_1": 0.00196,
413
+ "recall_at_10": 0.01847,
414
+ "mrr_at_10": 0.005544
415
+ },
416
+ "subsets": 1
417
+ },
418
+ {
419
+ "label": "AIST-87M 512",
420
+ "dimension": 512,
421
+ "task": "ClothoT2ARetrieval",
422
+ "family": "Audio recall",
423
+ "primary_metric": "ndcg_at_10",
424
+ "primary": 0.26792,
425
+ "metrics": {
426
+ "main_score": 0.33107,
427
+ "ndcg_at_10": 0.26792,
428
+ "recall_at_1": 0.1248,
429
+ "recall_at_10": 0.44261,
430
+ "mrr_at_10": 0.213985
431
+ },
432
+ "subsets": 1
433
+ }
434
+ ],
435
+ "source_result_dirs": [
436
+ "/shared/augmem/triembed/results/aist81m_raw1280_mn20_merged_teacher_20260503T0125Z_memory_slice_dim512/dim512/results/triembed__te-512d/best_model"
437
+ ]
438
+ },
439
+ {
440
+ "label": "AIST-95M 1280 Flickr",
441
+ "dimension": 1280,
442
+ "results_dir": "/shared/augmem/triembed/results/aist95m_1280_mieb_flickr_20260502T0217Z/dim1280/results/triembed__te-1280d/best_model",
443
+ "completed_tasks": 2,
444
+ "missing_tasks": [
445
+ "ClothoT2ARetrieval",
446
+ "CommonVoiceMini21T2ARetrieval",
447
+ "MACST2ARetrieval",
448
+ "STSBenchmark",
449
+ "SprintDuplicateQuestions",
450
+ "UrbanSound8KT2ARetrieval"
451
+ ],
452
+ "overall_mean": 0.485,
453
+ "family_means": {
454
+ "Image recall": 0.485
455
+ },
456
+ "rows": [
457
+ {
458
+ "label": "AIST-95M 1280 Flickr",
459
+ "dimension": 1280,
460
+ "task": "Flickr30kT2IRetrieval",
461
+ "family": "Image recall",
462
+ "primary_metric": "ndcg_at_10",
463
+ "primary": 0.50216,
464
+ "metrics": {
465
+ "main_score": 0.50216,
466
+ "ndcg_at_10": 0.50216,
467
+ "recall_at_1": 0.3254,
468
+ "recall_at_10": 0.7004,
469
+ "mrr_at_10": 0.439975
470
+ },
471
+ "subsets": 1
472
+ },
473
+ {
474
+ "label": "AIST-95M 1280 Flickr",
475
+ "dimension": 1280,
476
+ "task": "Flickr30kI2TRetrieval",
477
+ "family": "Image recall",
478
+ "primary_metric": "ndcg_at_10",
479
+ "primary": 0.46784,
480
+ "metrics": {
481
+ "main_score": 0.46784,
482
+ "ndcg_at_10": 0.46784,
483
+ "recall_at_1": 0.0958,
484
+ "recall_at_10": 0.5034,
485
+ "mrr_at_10": 0.598869
486
+ },
487
+ "subsets": 1
488
+ }
489
+ ],
490
+ "source_result_dirs": [
491
+ "/shared/augmem/triembed/results/aist95m_1280_mieb_flickr_20260502T0217Z/dim1280/results/triembed__te-1280d/best_model"
492
+ ]
493
+ },
494
+ {
495
+ "label": "ES-AIST-81M 768",
496
+ "dimension": 768,
497
+ "results_dir": "/shared/augmem/triembed/results/es_aist_memory_slice_default_20260501T1835Z/dim768/results/triembed__te-768d/best_model",
498
+ "completed_tasks": 8,
499
+ "missing_tasks": [],
500
+ "overall_mean": 0.30764677777777777,
501
+ "family_means": {
502
+ "Audio recall": 0.06462555555555556,
503
+ "Image recall": 0.271195,
504
+ "Text continuity": 0.830141
505
+ },
506
+ "rows": [
507
+ {
508
+ "label": "ES-AIST-81M 768",
509
+ "dimension": 768,
510
+ "task": "SprintDuplicateQuestions",
511
+ "family": "Text continuity",
512
+ "primary_metric": "main_score",
513
+ "primary": 0.916128,
514
+ "metrics": {
515
+ "main_score": 0.916128
516
+ },
517
+ "subsets": 1
518
+ },
519
+ {
520
+ "label": "ES-AIST-81M 768",
521
+ "dimension": 768,
522
+ "task": "STSBenchmark",
523
+ "family": "Text continuity",
524
+ "primary_metric": "main_score",
525
+ "primary": 0.744154,
526
+ "metrics": {
527
+ "main_score": 0.744154,
528
+ "cosine_spearman": 0.744154,
529
+ "spearman": 0.744154
530
+ },
531
+ "subsets": 1
532
+ },
533
+ {
534
+ "label": "ES-AIST-81M 768",
535
+ "dimension": 768,
536
+ "task": "Flickr30kT2IRetrieval",
537
+ "family": "Image recall",
538
+ "primary_metric": "ndcg_at_10",
539
+ "primary": 0.34676,
540
+ "metrics": {
541
+ "main_score": 0.34676,
542
+ "ndcg_at_10": 0.34676,
543
+ "recall_at_1": 0.1764,
544
+ "recall_at_10": 0.5528,
545
+ "mrr_at_10": 0.282987
546
+ },
547
+ "subsets": 1
548
+ },
549
+ {
550
+ "label": "ES-AIST-81M 768",
551
+ "dimension": 768,
552
+ "task": "Flickr30kI2TRetrieval",
553
+ "family": "Image recall",
554
+ "primary_metric": "ndcg_at_10",
555
+ "primary": 0.19563,
556
+ "metrics": {
557
+ "main_score": 0.19563,
558
+ "ndcg_at_10": 0.19563,
559
+ "recall_at_1": 0.037,
560
+ "recall_at_10": 0.2208,
561
+ "mrr_at_10": 0.295532
562
+ },
563
+ "subsets": 1
564
+ },
565
+ {
566
+ "label": "ES-AIST-81M 768",
567
+ "dimension": 768,
568
+ "task": "CommonVoiceMini21T2ARetrieval",
569
+ "family": "Audio recall",
570
+ "primary_metric": "ndcg_at_10",
571
+ "primary": 0.024182222222222223,
572
+ "metrics": {
573
+ "main_score": 0.02774760683760684,
574
+ "ndcg_at_10": 0.024182222222222223,
575
+ "recall_at_1": 0.005472478632478632,
576
+ "recall_at_10": 0.052400170940170944,
577
+ "mrr_at_10": 0.01579925641025641
578
+ },
579
+ "subsets": 117
580
+ },
581
+ {
582
+ "label": "ES-AIST-81M 768",
583
+ "dimension": 768,
584
+ "task": "MACST2ARetrieval",
585
+ "family": "Audio recall",
586
+ "primary_metric": "ndcg_at_10",
587
+ "primary": 0.07729,
588
+ "metrics": {
589
+ "main_score": 0.08906,
590
+ "ndcg_at_10": 0.07729,
591
+ "recall_at_1": 0.0229,
592
+ "recall_at_10": 0.15013,
593
+ "mrr_at_10": 0.055219
594
+ },
595
+ "subsets": 1
596
+ },
597
+ {
598
+ "label": "ES-AIST-81M 768",
599
+ "dimension": 768,
600
+ "task": "UrbanSound8KT2ARetrieval",
601
+ "family": "Audio recall",
602
+ "primary_metric": "ndcg_at_10",
603
+ "primary": 0.007,
604
+ "metrics": {
605
+ "main_score": 0.00747,
606
+ "ndcg_at_10": 0.007,
607
+ "recall_at_1": 0.00098,
608
+ "recall_at_10": 0.01631,
609
+ "mrr_at_10": 0.004257
610
+ },
611
+ "subsets": 1
612
+ },
613
+ {
614
+ "label": "ES-AIST-81M 768",
615
+ "dimension": 768,
616
+ "task": "ClothoT2ARetrieval",
617
+ "family": "Audio recall",
618
+ "primary_metric": "ndcg_at_10",
619
+ "primary": 0.15003,
620
+ "metrics": {
621
+ "main_score": 0.17601,
622
+ "ndcg_at_10": 0.15003,
623
+ "recall_at_1": 0.05121,
624
+ "recall_at_10": 0.28612,
625
+ "mrr_at_10": 0.108968
626
+ },
627
+ "subsets": 1
628
+ }
629
+ ],
630
+ "source_result_dirs": [
631
+ "/shared/augmem/triembed/results/es_aist_memory_slice_default_20260501T1835Z/dim768/results/triembed__te-768d/best_model"
632
+ ]
633
+ },
634
+ {
635
+ "label": "Native mn20 audio 768",
636
+ "dimension": 768,
637
+ "results_dir": "/shared/augmem/triembed/results/es_aist_memory_audio_native_default_20260501T1835Z/dim768/results/triembed__native-efficientat-768d/latest_model",
638
+ "completed_tasks": 4,
639
+ "missing_tasks": [
640
+ "Flickr30kI2TRetrieval",
641
+ "Flickr30kT2IRetrieval",
642
+ "STSBenchmark",
643
+ "SprintDuplicateQuestions"
644
+ ],
645
+ "overall_mean": 0.11513626068376069,
646
+ "family_means": {
647
+ "Audio recall": 0.11513626068376069
648
+ },
649
+ "rows": [
650
+ {
651
+ "label": "Native mn20 audio 768",
652
+ "dimension": 768,
653
+ "task": "CommonVoiceMini21T2ARetrieval",
654
+ "family": "Audio recall",
655
+ "primary_metric": "ndcg_at_10",
656
+ "primary": 0.035825042735042736,
657
+ "metrics": {
658
+ "main_score": 0.04166820512820513,
659
+ "ndcg_at_10": 0.035825042735042736,
660
+ "recall_at_1": 0.009125726495726495,
661
+ "recall_at_10": 0.07585017094017094,
662
+ "mrr_at_10": 0.023907692307692307
663
+ },
664
+ "subsets": 117
665
+ },
666
+ {
667
+ "label": "Native mn20 audio 768",
668
+ "dimension": 768,
669
+ "task": "MACST2ARetrieval",
670
+ "family": "Audio recall",
671
+ "primary_metric": "ndcg_at_10",
672
+ "primary": 0.12746,
673
+ "metrics": {
674
+ "main_score": 0.13995,
675
+ "ndcg_at_10": 0.12746,
676
+ "recall_at_1": 0.05852,
677
+ "recall_at_10": 0.22392,
678
+ "mrr_at_10": 0.098715
679
+ },
680
+ "subsets": 1
681
+ },
682
+ {
683
+ "label": "Native mn20 audio 768",
684
+ "dimension": 768,
685
+ "task": "UrbanSound8KT2ARetrieval",
686
+ "family": "Audio recall",
687
+ "primary_metric": "ndcg_at_10",
688
+ "primary": 0.00849,
689
+ "metrics": {
690
+ "main_score": 0.00923,
691
+ "ndcg_at_10": 0.00849,
692
+ "recall_at_1": 0.00196,
693
+ "recall_at_10": 0.01866,
694
+ "mrr_at_10": 0.005487
695
+ },
696
+ "subsets": 1
697
+ },
698
+ {
699
+ "label": "Native mn20 audio 768",
700
+ "dimension": 768,
701
+ "task": "ClothoT2ARetrieval",
702
+ "family": "Audio recall",
703
+ "primary_metric": "ndcg_at_10",
704
+ "primary": 0.28877,
705
+ "metrics": {
706
+ "main_score": 0.3581,
707
+ "ndcg_at_10": 0.28877,
708
+ "recall_at_1": 0.14414,
709
+ "recall_at_10": 0.4641,
710
+ "mrr_at_10": 0.234475
711
+ },
712
+ "subsets": 1
713
+ }
714
+ ],
715
+ "source_result_dirs": [
716
+ "/shared/augmem/triembed/results/es_aist_memory_audio_native_default_20260501T1835Z/dim768/results/triembed__native-efficientat-768d/latest_model"
717
+ ]
718
+ },
719
+ {
720
+ "label": "Dual-audio tower 1280",
721
+ "dimension": 1280,
722
+ "results_dir": "/shared/augmem/triembed/results/aist86m_full_mteb_mieb_maeb_1280_768_512_20260502T070609Z/dim1280/results/triembed__te-1280d/TE-86M-dual-audio-best_model",
723
+ "completed_tasks": 8,
724
+ "missing_tasks": [],
725
+ "overall_mean": 0.3973782852564103,
726
+ "family_means": {
727
+ "Audio recall": 0.11287532051282051,
728
+ "Image recall": 0.485,
729
+ "Text continuity": 0.8787625
730
+ },
731
+ "rows": [
732
+ {
733
+ "label": "Dual-audio tower 1280",
734
+ "dimension": 1280,
735
+ "task": "SprintDuplicateQuestions",
736
+ "family": "Text continuity",
737
+ "primary_metric": "main_score",
738
+ "primary": 0.953368,
739
+ "metrics": {
740
+ "main_score": 0.953368
741
+ },
742
+ "subsets": 1
743
+ },
744
+ {
745
+ "label": "Dual-audio tower 1280",
746
+ "dimension": 1280,
747
+ "task": "STSBenchmark",
748
+ "family": "Text continuity",
749
+ "primary_metric": "main_score",
750
+ "primary": 0.804157,
751
+ "metrics": {
752
+ "main_score": 0.804157,
753
+ "cosine_spearman": 0.804157,
754
+ "spearman": 0.804154
755
+ },
756
+ "subsets": 1
757
+ },
758
+ {
759
+ "label": "Dual-audio tower 1280",
760
+ "dimension": 1280,
761
+ "task": "Flickr30kT2IRetrieval",
762
+ "family": "Image recall",
763
+ "primary_metric": "ndcg_at_10",
764
+ "primary": 0.50216,
765
+ "metrics": {
766
+ "main_score": 0.50216,
767
+ "ndcg_at_10": 0.50216,
768
+ "recall_at_1": 0.3254,
769
+ "recall_at_10": 0.7004,
770
+ "mrr_at_10": 0.439975
771
+ },
772
+ "subsets": 1
773
+ },
774
+ {
775
+ "label": "Dual-audio tower 1280",
776
+ "dimension": 1280,
777
+ "task": "Flickr30kI2TRetrieval",
778
+ "family": "Image recall",
779
+ "primary_metric": "ndcg_at_10",
780
+ "primary": 0.46784,
781
+ "metrics": {
782
+ "main_score": 0.46784,
783
+ "ndcg_at_10": 0.46784,
784
+ "recall_at_1": 0.0958,
785
+ "recall_at_10": 0.5034,
786
+ "mrr_at_10": 0.598869
787
+ },
788
+ "subsets": 1
789
+ },
790
+ {
791
+ "label": "Dual-audio tower 1280",
792
+ "dimension": 1280,
793
+ "task": "CommonVoiceMini21T2ARetrieval",
794
+ "family": "Audio recall",
795
+ "primary_metric": "ndcg_at_10",
796
+ "primary": 0.03849128205128205,
797
+ "metrics": {
798
+ "main_score": 0.04426282051282051,
799
+ "ndcg_at_10": 0.03849128205128205,
800
+ "recall_at_1": 0.00971991452991453,
801
+ "recall_at_10": 0.08076905982905982,
802
+ "mrr_at_10": 0.02587371794871795
803
+ },
804
+ "subsets": 117
805
+ },
806
+ {
807
+ "label": "Dual-audio tower 1280",
808
+ "dimension": 1280,
809
+ "task": "MACST2ARetrieval",
810
+ "family": "Audio recall",
811
+ "primary_metric": "ndcg_at_10",
812
+ "primary": 0.10964,
813
+ "metrics": {
814
+ "main_score": 0.15522,
815
+ "ndcg_at_10": 0.10964,
816
+ "recall_at_1": 0.04326,
817
+ "recall_at_10": 0.19338,
818
+ "mrr_at_10": 0.083683
819
+ },
820
+ "subsets": 1
821
+ },
822
+ {
823
+ "label": "Dual-audio tower 1280",
824
+ "dimension": 1280,
825
+ "task": "UrbanSound8KT2ARetrieval",
826
+ "family": "Audio recall",
827
+ "primary_metric": "ndcg_at_10",
828
+ "primary": 0.00823,
829
+ "metrics": {
830
+ "main_score": 0.00904,
831
+ "ndcg_at_10": 0.00823,
832
+ "recall_at_1": 0.00177,
833
+ "recall_at_10": 0.01807,
834
+ "mrr_at_10": 0.00531
835
+ },
836
+ "subsets": 1
837
+ },
838
+ {
839
+ "label": "Dual-audio tower 1280",
840
+ "dimension": 1280,
841
+ "task": "ClothoT2ARetrieval",
842
+ "family": "Audio recall",
843
+ "primary_metric": "ndcg_at_10",
844
+ "primary": 0.29514,
845
+ "metrics": {
846
+ "main_score": 0.36043,
847
+ "ndcg_at_10": 0.29514,
848
+ "recall_at_1": 0.14861,
849
+ "recall_at_10": 0.47395,
850
+ "mrr_at_10": 0.239903
851
+ },
852
+ "subsets": 1
853
+ }
854
+ ],
855
+ "source_result_dirs": [
856
+ "/shared/augmem/triembed/results/aist86m_full_mteb_mieb_maeb_1280_768_512_20260502T070609Z/dim1280/results/triembed__te-1280d/TE-86M-dual-audio-best_model"
857
+ ]
858
+ },
859
+ {
860
+ "label": "Dual-audio tower 768",
861
+ "dimension": 768,
862
+ "results_dir": "/shared/augmem/triembed/results/aist86m_full_mteb_mieb_maeb_1280_768_512_20260502T070609Z/dim768/results/triembed__te-768d/TE-86M-dual-audio-best_model",
863
+ "completed_tasks": 6,
864
+ "missing_tasks": [
865
+ "MACST2ARetrieval",
866
+ "UrbanSound8KT2ARetrieval"
867
+ ],
868
+ "overall_mean": 0.5098147193732193,
869
+ "family_means": {
870
+ "Audio recall": 0.16678465811965812,
871
+ "Image recall": 0.48403999999999997,
872
+ "Text continuity": 0.8786195
873
+ },
874
+ "rows": [
875
+ {
876
+ "label": "Dual-audio tower 768",
877
+ "dimension": 768,
878
+ "task": "SprintDuplicateQuestions",
879
+ "family": "Text continuity",
880
+ "primary_metric": "main_score",
881
+ "primary": 0.953072,
882
+ "metrics": {
883
+ "main_score": 0.953072
884
+ },
885
+ "subsets": 1
886
+ },
887
+ {
888
+ "label": "Dual-audio tower 768",
889
+ "dimension": 768,
890
+ "task": "STSBenchmark",
891
+ "family": "Text continuity",
892
+ "primary_metric": "main_score",
893
+ "primary": 0.804167,
894
+ "metrics": {
895
+ "main_score": 0.804167,
896
+ "cosine_spearman": 0.804167,
897
+ "spearman": 0.804167
898
+ },
899
+ "subsets": 1
900
+ },
901
+ {
902
+ "label": "Dual-audio tower 768",
903
+ "dimension": 768,
904
+ "task": "Flickr30kT2IRetrieval",
905
+ "family": "Image recall",
906
+ "primary_metric": "ndcg_at_10",
907
+ "primary": 0.50179,
908
+ "metrics": {
909
+ "main_score": 0.50179,
910
+ "ndcg_at_10": 0.50179,
911
+ "recall_at_1": 0.3254,
912
+ "recall_at_10": 0.698,
913
+ "mrr_at_10": 0.440147
914
+ },
915
+ "subsets": 1
916
+ },
917
+ {
918
+ "label": "Dual-audio tower 768",
919
+ "dimension": 768,
920
+ "task": "Flickr30kI2TRetrieval",
921
+ "family": "Image recall",
922
+ "primary_metric": "ndcg_at_10",
923
+ "primary": 0.46629,
924
+ "metrics": {
925
+ "main_score": 0.46629,
926
+ "ndcg_at_10": 0.46629,
927
+ "recall_at_1": 0.0956,
928
+ "recall_at_10": 0.5022,
929
+ "mrr_at_10": 0.597365
930
+ },
931
+ "subsets": 1
932
+ },
933
+ {
934
+ "label": "Dual-audio tower 768",
935
+ "dimension": 768,
936
+ "task": "CommonVoiceMini21T2ARetrieval",
937
+ "family": "Audio recall",
938
+ "primary_metric": "ndcg_at_10",
939
+ "primary": 0.03849931623931624,
940
+ "metrics": {
941
+ "main_score": 0.04466316239316239,
942
+ "ndcg_at_10": 0.03849931623931624,
943
+ "recall_at_1": 0.009814871794871794,
944
+ "recall_at_10": 0.08058384615384616,
945
+ "mrr_at_10": 0.025928871794871796
946
+ },
947
+ "subsets": 117
948
+ },
949
+ {
950
+ "label": "Dual-audio tower 768",
951
+ "dimension": 768,
952
+ "task": "ClothoT2ARetrieval",
953
+ "family": "Audio recall",
954
+ "primary_metric": "ndcg_at_10",
955
+ "primary": 0.29507,
956
+ "metrics": {
957
+ "main_score": 0.3615,
958
+ "ndcg_at_10": 0.29507,
959
+ "recall_at_1": 0.14861,
960
+ "recall_at_10": 0.47359,
961
+ "mrr_at_10": 0.239883
962
+ },
963
+ "subsets": 1
964
+ }
965
+ ],
966
+ "source_result_dirs": [
967
+ "/shared/augmem/triembed/results/aist86m_full_mteb_mieb_maeb_1280_768_512_20260502T070609Z/dim768/results/triembed__te-768d/TE-86M-dual-audio-best_model"
968
+ ]
969
+ },
970
+ {
971
+ "label": "Dual-audio tower 512",
972
+ "dimension": 512,
973
+ "results_dir": "/shared/augmem/triembed/results/aist86m_full_mteb_mieb_maeb_1280_768_512_20260502T070609Z/dim512/results/triembed__te-512d/TE-86M-dual-audio-best_model",
974
+ "completed_tasks": 4,
975
+ "missing_tasks": [
976
+ "Flickr30kI2TRetrieval",
977
+ "Flickr30kT2IRetrieval",
978
+ "MACST2ARetrieval",
979
+ "UrbanSound8KT2ARetrieval"
980
+ ],
981
+ "overall_mean": 0.5228179594017094,
982
+ "family_means": {
983
+ "Audio recall": 0.16697341880341882,
984
+ "Text continuity": 0.8786625
985
+ },
986
+ "rows": [
987
+ {
988
+ "label": "Dual-audio tower 512",
989
+ "dimension": 512,
990
+ "task": "SprintDuplicateQuestions",
991
+ "family": "Text continuity",
992
+ "primary_metric": "main_score",
993
+ "primary": 0.952893,
994
+ "metrics": {
995
+ "main_score": 0.952893
996
+ },
997
+ "subsets": 1
998
+ },
999
+ {
1000
+ "label": "Dual-audio tower 512",
1001
+ "dimension": 512,
1002
+ "task": "STSBenchmark",
1003
+ "family": "Text continuity",
1004
+ "primary_metric": "main_score",
1005
+ "primary": 0.804432,
1006
+ "metrics": {
1007
+ "main_score": 0.804432,
1008
+ "cosine_spearman": 0.804432,
1009
+ "spearman": 0.804432
1010
+ },
1011
+ "subsets": 1
1012
+ },
1013
+ {
1014
+ "label": "Dual-audio tower 512",
1015
+ "dimension": 512,
1016
+ "task": "CommonVoiceMini21T2ARetrieval",
1017
+ "family": "Audio recall",
1018
+ "primary_metric": "ndcg_at_10",
1019
+ "primary": 0.03858683760683761,
1020
+ "metrics": {
1021
+ "main_score": 0.04408854700854701,
1022
+ "ndcg_at_10": 0.03858683760683761,
1023
+ "recall_at_1": 0.00959076923076923,
1024
+ "recall_at_10": 0.08129623931623932,
1025
+ "mrr_at_10": 0.025843299145299144
1026
+ },
1027
+ "subsets": 117
1028
+ },
1029
+ {
1030
+ "label": "Dual-audio tower 512",
1031
+ "dimension": 512,
1032
+ "task": "ClothoT2ARetrieval",
1033
+ "family": "Audio recall",
1034
+ "primary_metric": "ndcg_at_10",
1035
+ "primary": 0.29536,
1036
+ "metrics": {
1037
+ "main_score": 0.35882,
1038
+ "ndcg_at_10": 0.29536,
1039
+ "recall_at_1": 0.1513,
1040
+ "recall_at_10": 0.47162,
1041
+ "mrr_at_10": 0.240905
1042
+ },
1043
+ "subsets": 1
1044
+ }
1045
+ ],
1046
+ "source_result_dirs": [
1047
+ "/shared/augmem/triembed/results/aist86m_full_mteb_mieb_maeb_1280_768_512_20260502T070609Z/dim512/results/triembed__te-512d/TE-86M-dual-audio-best_model"
1048
+ ]
1049
+ }
1050
+ ],
1051
+ "comparisons": [
1052
+ {
1053
+ "baseline": "ES-AIST-81M 768",
1054
+ "target": "AIST-87M 768",
1055
+ "paired_tasks": 8,
1056
+ "mean_absolute_delta": 0.041065177350427334,
1057
+ "rows": [
1058
+ {
1059
+ "task": "SprintDuplicateQuestions",
1060
+ "dimension": 768,
1061
+ "family": "Text continuity",
1062
+ "baseline": "ES-AIST-81M 768",
1063
+ "baseline_primary": 0.916128,
1064
+ "target": "AIST-87M 768",
1065
+ "target_primary": 0.874231,
1066
+ "absolute_delta": -0.04189700000000007,
1067
+ "relative_delta_pct": -4.573269237486473
1068
+ },
1069
+ {
1070
+ "task": "STSBenchmark",
1071
+ "dimension": 768,
1072
+ "family": "Text continuity",
1073
+ "baseline": "ES-AIST-81M 768",
1074
+ "baseline_primary": 0.744154,
1075
+ "target": "AIST-87M 768",
1076
+ "target_primary": 0.650759,
1077
+ "absolute_delta": -0.093395,
1078
+ "relative_delta_pct": -12.550493580629817
1079
+ },
1080
+ {
1081
+ "task": "Flickr30kT2IRetrieval",
1082
+ "dimension": 768,
1083
+ "family": "Image recall",
1084
+ "baseline": "ES-AIST-81M 768",
1085
+ "baseline_primary": 0.34676,
1086
+ "target": "AIST-87M 768",
1087
+ "target_primary": 0.46701,
1088
+ "absolute_delta": 0.12024999999999997,
1089
+ "relative_delta_pct": 34.67816357134617
1090
+ },
1091
+ {
1092
+ "task": "Flickr30kI2TRetrieval",
1093
+ "dimension": 768,
1094
+ "family": "Image recall",
1095
+ "baseline": "ES-AIST-81M 768",
1096
+ "baseline_primary": 0.19563,
1097
+ "target": "AIST-87M 768",
1098
+ "target_primary": 0.38062,
1099
+ "absolute_delta": 0.18499000000000002,
1100
+ "relative_delta_pct": 94.56116137606708
1101
+ },
1102
+ {
1103
+ "task": "CommonVoiceMini21T2ARetrieval",
1104
+ "dimension": 768,
1105
+ "family": "Audio recall",
1106
+ "baseline": "ES-AIST-81M 768",
1107
+ "baseline_primary": 0.024182222222222223,
1108
+ "target": "AIST-87M 768",
1109
+ "target_primary": 0.028395641025641027,
1110
+ "absolute_delta": 0.004213418803418804,
1111
+ "relative_delta_pct": 17.423621223474193
1112
+ },
1113
+ {
1114
+ "task": "MACST2ARetrieval",
1115
+ "dimension": 768,
1116
+ "family": "Audio recall",
1117
+ "baseline": "ES-AIST-81M 768",
1118
+ "baseline_primary": 0.07729,
1119
+ "target": "AIST-87M 768",
1120
+ "target_primary": 0.11149,
1121
+ "absolute_delta": 0.03420000000000001,
1122
+ "relative_delta_pct": 44.24893259153838
1123
+ },
1124
+ {
1125
+ "task": "UrbanSound8KT2ARetrieval",
1126
+ "dimension": 768,
1127
+ "family": "Audio recall",
1128
+ "baseline": "ES-AIST-81M 768",
1129
+ "baseline_primary": 0.007,
1130
+ "target": "AIST-87M 768",
1131
+ "target_primary": 0.00851,
1132
+ "absolute_delta": 0.00151,
1133
+ "relative_delta_pct": 21.571428571428573
1134
+ },
1135
+ {
1136
+ "task": "ClothoT2ARetrieval",
1137
+ "dimension": 768,
1138
+ "family": "Audio recall",
1139
+ "baseline": "ES-AIST-81M 768",
1140
+ "baseline_primary": 0.15003,
1141
+ "target": "AIST-87M 768",
1142
+ "target_primary": 0.26868,
1143
+ "absolute_delta": 0.11864999999999998,
1144
+ "relative_delta_pct": 79.08418316336731
1145
+ }
1146
+ ]
1147
+ },
1148
+ {
1149
+ "baseline": "Native mn20 audio 768",
1150
+ "target": "AIST-87M 768",
1151
+ "paired_tasks": 4,
1152
+ "mean_absolute_delta": -0.010867350427350436,
1153
+ "rows": [
1154
+ {
1155
+ "task": "CommonVoiceMini21T2ARetrieval",
1156
+ "dimension": 768,
1157
+ "family": "Audio recall",
1158
+ "baseline": "Native mn20 audio 768",
1159
+ "baseline_primary": 0.035825042735042736,
1160
+ "target": "AIST-87M 768",
1161
+ "target_primary": 0.028395641025641027,
1162
+ "absolute_delta": -0.00742940170940171,
1163
+ "relative_delta_pct": -20.73801213399403
1164
+ },
1165
+ {
1166
+ "task": "MACST2ARetrieval",
1167
+ "dimension": 768,
1168
+ "family": "Audio recall",
1169
+ "baseline": "Native mn20 audio 768",
1170
+ "baseline_primary": 0.12746,
1171
+ "target": "AIST-87M 768",
1172
+ "target_primary": 0.11149,
1173
+ "absolute_delta": -0.015969999999999984,
1174
+ "relative_delta_pct": -12.529420994821894
1175
+ },
1176
+ {
1177
+ "task": "UrbanSound8KT2ARetrieval",
1178
+ "dimension": 768,
1179
+ "family": "Audio recall",
1180
+ "baseline": "Native mn20 audio 768",
1181
+ "baseline_primary": 0.00849,
1182
+ "target": "AIST-87M 768",
1183
+ "target_primary": 0.00851,
1184
+ "absolute_delta": 2.000000000000092e-05,
1185
+ "relative_delta_pct": 0.2355712603062535
1186
+ },
1187
+ {
1188
+ "task": "ClothoT2ARetrieval",
1189
+ "dimension": 768,
1190
+ "family": "Audio recall",
1191
+ "baseline": "Native mn20 audio 768",
1192
+ "baseline_primary": 0.28877,
1193
+ "target": "AIST-87M 768",
1194
+ "target_primary": 0.26868,
1195
+ "absolute_delta": -0.020090000000000052,
1196
+ "relative_delta_pct": -6.9570938809433285
1197
+ }
1198
+ ]
1199
+ },
1200
+ {
1201
+ "baseline": "Dual-audio tower 768",
1202
+ "target": "AIST-87M 768",
1203
+ "paired_tasks": 6,
1204
+ "mean_absolute_delta": -0.06486544586894587,
1205
+ "rows": [
1206
+ {
1207
+ "task": "SprintDuplicateQuestions",
1208
+ "dimension": 768,
1209
+ "family": "Text continuity",
1210
+ "baseline": "Dual-audio tower 768",
1211
+ "baseline_primary": 0.953072,
1212
+ "target": "AIST-87M 768",
1213
+ "target_primary": 0.874231,
1214
+ "absolute_delta": -0.07884100000000005,
1215
+ "relative_delta_pct": -8.272302617220948
1216
+ },
1217
+ {
1218
+ "task": "STSBenchmark",
1219
+ "dimension": 768,
1220
+ "family": "Text continuity",
1221
+ "baseline": "Dual-audio tower 768",
1222
+ "baseline_primary": 0.804167,
1223
+ "target": "AIST-87M 768",
1224
+ "target_primary": 0.650759,
1225
+ "absolute_delta": -0.153408,
1226
+ "relative_delta_pct": -19.076634579633335
1227
+ },
1228
+ {
1229
+ "task": "Flickr30kT2IRetrieval",
1230
+ "dimension": 768,
1231
+ "family": "Image recall",
1232
+ "baseline": "Dual-audio tower 768",
1233
+ "baseline_primary": 0.50179,
1234
+ "target": "AIST-87M 768",
1235
+ "target_primary": 0.46701,
1236
+ "absolute_delta": -0.03477999999999998,
1237
+ "relative_delta_pct": -6.931186352856769
1238
+ },
1239
+ {
1240
+ "task": "Flickr30kI2TRetrieval",
1241
+ "dimension": 768,
1242
+ "family": "Image recall",
1243
+ "baseline": "Dual-audio tower 768",
1244
+ "baseline_primary": 0.46629,
1245
+ "target": "AIST-87M 768",
1246
+ "target_primary": 0.38062,
1247
+ "absolute_delta": -0.08566999999999997,
1248
+ "relative_delta_pct": -18.372686525552762
1249
+ },
1250
+ {
1251
+ "task": "CommonVoiceMini21T2ARetrieval",
1252
+ "dimension": 768,
1253
+ "family": "Audio recall",
1254
+ "baseline": "Dual-audio tower 768",
1255
+ "baseline_primary": 0.03849931623931624,
1256
+ "target": "AIST-87M 768",
1257
+ "target_primary": 0.028395641025641027,
1258
+ "absolute_delta": -0.010103675213675212,
1259
+ "relative_delta_pct": -26.243778333281533
1260
+ },
1261
+ {
1262
+ "task": "ClothoT2ARetrieval",
1263
+ "dimension": 768,
1264
+ "family": "Audio recall",
1265
+ "baseline": "Dual-audio tower 768",
1266
+ "baseline_primary": 0.29507,
1267
+ "target": "AIST-87M 768",
1268
+ "target_primary": 0.26868,
1269
+ "absolute_delta": -0.026390000000000025,
1270
+ "relative_delta_pct": -8.943640492086633
1271
+ }
1272
+ ]
1273
+ }
1274
+ ]
1275
+ }
aist87m_memory_slice_release_report.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AIST-87M Human-Memory Evaluation Slice
2
+
3
+ Primary metrics are `main_score` for text continuity tasks and `NDCG@10` for image/audio retrieval tasks.
4
+
5
+ | Run | Dim | Tasks | Text continuity | Image recall | Audio recall | Overall | Missing |
6
+ |---|---:|---:|---:|---:|---:|---:|---|
7
+ | AIST-87M 1280 | 1280 | 8 | 0.763 | 0.425 | 0.104 | 0.349 | none |
8
+ | AIST-87M 768 | 768 | 8 | 0.762 | 0.424 | 0.104 | 0.349 | none |
9
+ | AIST-87M 512 | 512 | 8 | 0.762 | 0.424 | 0.104 | 0.349 | none |
10
+ | AIST-95M 1280 Flickr | 1280 | 2 | - | 0.485 | - | 0.485 | ClothoT2ARetrieval, CommonVoiceMini21T2ARetrieval, MACST2ARetrieval, STSBenchmark, SprintDuplicateQuestions, UrbanSound8KT2ARetrieval |
11
+ | ES-AIST-81M 768 | 768 | 8 | 0.830 | 0.271 | 0.065 | 0.308 | none |
12
+ | Native mn20 audio 768 | 768 | 4 | - | - | 0.115 | 0.115 | Flickr30kI2TRetrieval, Flickr30kT2IRetrieval, STSBenchmark, SprintDuplicateQuestions |
13
+ | Dual-audio tower 1280 | 1280 | 8 | 0.879 | 0.485 | 0.113 | 0.397 | none |
14
+ | Dual-audio tower 768 | 768 | 6 | 0.879 | 0.484 | 0.167 | 0.510 | MACST2ARetrieval, UrbanSound8KT2ARetrieval |
15
+ | Dual-audio tower 512 | 512 | 4 | 0.879 | - | 0.167 | 0.523 | Flickr30kI2TRetrieval, Flickr30kT2IRetrieval, MACST2ARetrieval, UrbanSound8KT2ARetrieval |
16
+
17
+ ## AIST-87M Per-Task Scores
18
+
19
+ | Dim | Task | Family | Metric | Score | R@1 | R@10 | MRR@10 |
20
+ |---:|---|---|---|---:|---:|---:|---:|
21
+ | 1280 | SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - | - |
22
+ | 1280 | STSBenchmark | Text continuity | main_score | 0.651 | - | - | - |
23
+ | 1280 | Flickr30kT2IRetrieval | Image recall | ndcg_at_10 | 0.469 | 0.296 | 0.672 | 0.405 |
24
+ | 1280 | Flickr30kI2TRetrieval | Image recall | ndcg_at_10 | 0.381 | 0.082 | 0.407 | 0.534 |
25
+ | 1280 | CommonVoiceMini21T2ARetrieval | Audio recall | ndcg_at_10 | 0.028 | 0.006 | 0.062 | 0.018 |
26
+ | 1280 | MACST2ARetrieval | Audio recall | ndcg_at_10 | 0.110 | 0.033 | 0.214 | 0.079 |
27
+ | 1280 | UrbanSound8KT2ARetrieval | Audio recall | ndcg_at_10 | 0.009 | 0.002 | 0.018 | 0.006 |
28
+ | 1280 | ClothoT2ARetrieval | Audio recall | ndcg_at_10 | 0.269 | 0.128 | 0.443 | 0.216 |
29
+ | 768 | SprintDuplicateQuestions | Text continuity | main_score | 0.874 | - | - | - |
30
+ | 768 | STSBenchmark | Text continuity | main_score | 0.651 | - | - | - |
31
+ | 768 | Flickr30kT2IRetrieval | Image recall | ndcg_at_10 | 0.467 | 0.292 | 0.671 | 0.403 |
32
+ | 768 | Flickr30kI2TRetrieval | Image recall | ndcg_at_10 | 0.381 | 0.081 | 0.406 | 0.533 |
33
+ | 768 | CommonVoiceMini21T2ARetrieval | Audio recall | ndcg_at_10 | 0.028 | 0.006 | 0.062 | 0.018 |
34
+ | 768 | MACST2ARetrieval | Audio recall | ndcg_at_10 | 0.111 | 0.033 | 0.216 | 0.080 |
35
+ | 768 | UrbanSound8KT2ARetrieval | Audio recall | ndcg_at_10 | 0.009 | 0.002 | 0.018 | 0.006 |
36
+ | 768 | ClothoT2ARetrieval | Audio recall | ndcg_at_10 | 0.269 | 0.127 | 0.442 | 0.215 |
37
+ | 512 | SprintDuplicateQuestions | Text continuity | main_score | 0.874 | - | - | - |
38
+ | 512 | STSBenchmark | Text continuity | main_score | 0.651 | - | - | - |
39
+ | 512 | Flickr30kT2IRetrieval | Image recall | ndcg_at_10 | 0.468 | 0.295 | 0.670 | 0.405 |
40
+ | 512 | Flickr30kI2TRetrieval | Image recall | ndcg_at_10 | 0.381 | 0.082 | 0.405 | 0.535 |
41
+ | 512 | CommonVoiceMini21T2ARetrieval | Audio recall | ndcg_at_10 | 0.028 | 0.006 | 0.061 | 0.019 |
42
+ | 512 | MACST2ARetrieval | Audio recall | ndcg_at_10 | 0.113 | 0.033 | 0.221 | 0.080 |
43
+ | 512 | UrbanSound8KT2ARetrieval | Audio recall | ndcg_at_10 | 0.009 | 0.002 | 0.018 | 0.006 |
44
+ | 512 | ClothoT2ARetrieval | Audio recall | ndcg_at_10 | 0.268 | 0.125 | 0.443 | 0.214 |
45
+
46
+ ## Paired Comparisons
47
+
48
+ ### AIST-87M 768 vs ES-AIST-81M 768
49
+
50
+ Mean absolute delta over 8 paired tasks: 0.041.
51
+
52
+ | Dim | Task | Baseline | Target | Absolute delta | Relative delta |
53
+ |---:|---|---:|---:|---:|---:|
54
+ | 768 | SprintDuplicateQuestions | 0.916 | 0.874 | -0.042 | -4.6% |
55
+ | 768 | STSBenchmark | 0.744 | 0.651 | -0.093 | -12.6% |
56
+ | 768 | Flickr30kT2IRetrieval | 0.347 | 0.467 | 0.120 | 34.7% |
57
+ | 768 | Flickr30kI2TRetrieval | 0.196 | 0.381 | 0.185 | 94.6% |
58
+ | 768 | CommonVoiceMini21T2ARetrieval | 0.024 | 0.028 | 0.004 | 17.4% |
59
+ | 768 | MACST2ARetrieval | 0.077 | 0.111 | 0.034 | 44.2% |
60
+ | 768 | UrbanSound8KT2ARetrieval | 0.007 | 0.009 | 0.002 | 21.6% |
61
+ | 768 | ClothoT2ARetrieval | 0.150 | 0.269 | 0.119 | 79.1% |
62
+
63
+ ### AIST-87M 768 vs Native mn20 audio 768
64
+
65
+ Mean absolute delta over 4 paired tasks: -0.011.
66
+
67
+ | Dim | Task | Baseline | Target | Absolute delta | Relative delta |
68
+ |---:|---|---:|---:|---:|---:|
69
+ | 768 | CommonVoiceMini21T2ARetrieval | 0.036 | 0.028 | -0.007 | -20.7% |
70
+ | 768 | MACST2ARetrieval | 0.127 | 0.111 | -0.016 | -12.5% |
71
+ | 768 | UrbanSound8KT2ARetrieval | 0.008 | 0.009 | 0.000 | 0.2% |
72
+ | 768 | ClothoT2ARetrieval | 0.289 | 0.269 | -0.020 | -7.0% |
73
+
74
+ ### AIST-87M 768 vs Dual-audio tower 768
75
+
76
+ Mean absolute delta over 6 paired tasks: -0.065.
77
+
78
+ | Dim | Task | Baseline | Target | Absolute delta | Relative delta |
79
+ |---:|---|---:|---:|---:|---:|
80
+ | 768 | SprintDuplicateQuestions | 0.953 | 0.874 | -0.079 | -8.3% |
81
+ | 768 | STSBenchmark | 0.804 | 0.651 | -0.153 | -19.1% |
82
+ | 768 | Flickr30kT2IRetrieval | 0.502 | 0.467 | -0.035 | -6.9% |
83
+ | 768 | Flickr30kI2TRetrieval | 0.466 | 0.381 | -0.086 | -18.4% |
84
+ | 768 | CommonVoiceMini21T2ARetrieval | 0.038 | 0.028 | -0.010 | -26.2% |
85
+ | 768 | ClothoT2ARetrieval | 0.295 | 0.269 | -0.026 | -8.9% |
aist_81m_raw_mn20_lora.yaml ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Raw AIST-81M baseline.
2
+ #
3
+ # Generic trimodal InfoNCE teacher recipe using the single native mn20_as LoRA
4
+ # audio path already used by ES-AIST-81M. This intentionally has no ES/entity
5
+ # signal layout or entity-specific corpus/loss. It is the full-data baseline;
6
+ # subset cache aliases are only for smoke tests.
7
+
8
+ dataset_dir: datasets
9
+ dataset_name: wordnet_2024_openai_validaudio
10
+ cache_dir: cache
11
+
12
+ encoder_name: mobilenetv4_conv_medium.e180_r384_in12k
13
+ encoder_dim: 1280
14
+ modality: trimodal
15
+ audio_encoder_dim: 1280
16
+ audio_finetune_last_n_stages: 0
17
+ projection_hidden_dim: 1920
18
+ projection_output_dim: 1280
19
+ projection_dropout: 0.30
20
+ signal_layout: raw
21
+
22
+ batch_size: 4096
23
+ max_epochs: 30
24
+ learning_rate: 0.0012
25
+ weight_decay: 0.0001
26
+ warmup_fraction: 0.05
27
+ grad_clip_norm: 1.0
28
+ gradient_accumulation_steps: 1
29
+
30
+ matryoshka_dims: [1280, 768, 512, 256, 128]
31
+ matryoshka_weights: [1.0, 1.0, 1.0, 1.0, 1.0]
32
+
33
+ loss_type: infonce
34
+ temperature: 0.07
35
+ temperature_min: 0.01
36
+ learn_temperature: true
37
+ hard_neg_k: 0
38
+ hard_neg_weight: 2.0
39
+ false_neg_threshold: 0.0
40
+
41
+ feature_noise_std: 0.0
42
+ feature_mask_ratio: 0.0
43
+ mixup_alpha: 0.0
44
+
45
+ num_workers: 8
46
+ pin_memory: true
47
+ prefetch_factor: 2
48
+ persistent_workers: true
49
+ mixed_precision: bf16
50
+
51
+ checkpoint_dir: checkpoints
52
+ save_every_n_epochs: 2
53
+ early_stopping_patience: 8
54
+ log_dir: runs
55
+ benchmark_eval_every_epochs: 1
parameter_breakdown.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "text_encoder": 22861056,
3
+ "image_encoder": 8434512,
4
+ "audio_encoder_native_mn20_merged": 19886566,
5
+ "image_projection": 12306560,
6
+ "audio_projection": 12306560,
7
+ "text_projection": 11323520,
8
+ "total_exact_loaded_params": 87118774
9
+ }