VisTrajQA → Eval Framework 数据适配指南

概述

convert_vistrajqa.py 将 VisTrajQA 的 sessions-*.jsonl 转换为 eval_framework 所需的 domain_a_v2 三文件格式，从而可以用 Mem-Gallery / A-Mem / MemoryOS 等 baseline 进行统一评测。

快速使用

# 转换所有数据源（text-only 模式，默认）
python -m eval_framework.datasets.convert_vistrajqa \
    --input data/generated/sessions-vab.jsonl \
           data/generated/sessions-eb-nav.jsonl \
           data/generated/sessions-arena.jsonl \
           data/generated/sessions-eb-alfred.jsonl \
           data/generated/sessions-infini-thor.jsonl \
    --output eval_framework/converted/all

# 只转换某个数据源
python -m eval_framework.datasets.convert_vistrajqa \
    --input data/generated/sessions-vab.jsonl \
    --output eval_framework/converted/vab

# multimodal 模式（image caption 作为 attachment 而非内联文本）
python -m eval_framework.datasets.convert_vistrajqa \
    --input data/generated/sessions-vab.jsonl \
    --output eval_framework/converted/vab-mm \
    --multimodal

# 转换后直接跑 eval
python -m eval_framework.cli \
    --dataset eval_framework/converted/all \
    --baseline FUMemory \
    --output-dir eval_framework/results/FUMemory

转换映射

数据结构映射

VisTrajQA session                →  eval_framework sample
├── session_id                   →  sample_id
├── step_plan[]                  →  sessions[].dialogue[] (user + assistant turns)
├── probes[]                     →  checkpoints[] (probe checkpoints)
├── post_trajectory_qa[]         →  checkpoints[-1] (post-trajectory checkpoint)
└── memory_points[]              →  gold memory points (S00 embedded + stage4)

Session 切分

一条 VisTrajQA 轨迹（如 30 步，4 个 probe 在 step 6/12/18/24）按 probe 边界切分为 5 个 session：

步骤 1-6   → S00   (probe 1 在此 session 结束后触发)
步骤 7-12  → S01   (probe 2)
步骤 13-18 → S02   (probe 3)
步骤 19-24 → S03   (probe 4)
步骤 25-30 → S04   (post-trajectory QA 在全部 session 结束后触发)

这样保证 eval_framework 的 runner 在每个 session 完成后恰好触发对应的 checkpoint。

Turn 构建

每个 step 生成 2 个 dialogue turn：

Turn	Role	内容
User turn	`user`	OBSERVATION + FEEDBACK + IMAGE caption（text-only 模式）
Assistant turn	`assistant`	THOUGHT + ACTION

text-only 模式（默认）：image caption 直接写入 user turn 文本，格式为 IMAGE: <caption>。适用于所有 text-only baseline。

multimodal 模式（--multimodal）：image caption 作为 attachment 附加，不写入正文。适用于 MMMemory 等多模态 baseline。

Memory Point 映射

VisTrajQA 字段	eval_framework 字段	说明
`mp_id`	`memory_id`	如 `mp_S04_1`
`content`	`memory_content`	一句话事实描述
`type`	`memory_type`	`event_memory` / `state_memory` / `spatial_memory`
`source`	`memory_source`	`primary` (文本) / `secondary` (推断)
`is_update`	`is_update`	是否为更新型记忆
`original_memories`	`original_memories`	被替换的旧内容列表
`importance`	`importance`	0.4 / 0.6 / 0.8 / 1.0
`update_type`	`update_type`	`status_update` / `location_change` / ...

Memory point 按 step_num 分配到对应 session：

S00 的 memory points 嵌入在 domain_a_v2.json 的 session 对象中
其他 session 的 memory points 写入 stage4_memory_points.jsonl

QA / Checkpoint 映射

Probe checkpoint：每个 probe 生成一个 checkpoint，covered_sessions 为该 probe 及之前所有 session。

Post-trajectory checkpoint：覆盖全部 session，包含 9 类 QA。

VisTrajQA QA type	eval_framework question_type	缩写
FR	factual_recall	FR
DU	dynamic_update	DU
MB	memory_boundary	MB
TR	temporal_reasoning	TR
KR	knowledge_reasoning	KR
VFR	visual_factual_recall	VFR
VS	visual_search	VS
VU	visual_update	VU
CMR	cross_modal_reasoning	CMR

Evidence 字段从 ["mp_S04_1"]（字符串列表）转换为 [{"memory_id": "mp_S04_1"}]（字典列表）以匹配 eval_framework 格式。

输出文件

eval_framework/converted/all/
├── domain_a_v2.json               # 主对话数据 (JSON array)
├── stage4_memory_points.jsonl     # 每 session 的 gold memory points
└── stage4b_qa_checkpoints.jsonl   # checkpoint QA 题目

评测维度与 VisTrajQA 的对应

eval_framework 维度	测量内容	对应 VisTrajQA 特性
Memory Recall	记忆系统存储了多少 gold points	直接对应，所有 MP 类型
Memory Correctness	存储的记忆是否正确	检测 hallucination
Update Handling	更新型记忆是否正确替换	对应 `is_update=true` 的 MP
Interference Rejection	干扰信息是否被过滤	VisTrajQA 无 interference 标注，此维度为空
QA Accuracy	问答正确率	对应 9 类 QA (FR/DU/MB/TR/KR/VFR/VS/VU/CMR)
Evidence Coverage	回答引用了多少 gold evidence	对应 evidence memory_point_ids

注意：VisTrajQA 没有 interference（干扰信息）标注，因此 eval_framework 的 Interference Rejection 维度在评测结果中会为空值。MB（Memory Boundary）类型的题目在 QA 层面测试了类似能力。

注意事项

text-only baseline（FU/ST/LT/GA/MG/RF）：使用默认 --text-only，image caption 内联到用户消息文本中
multimodal baseline（MM/MMFU/NG/AUGUSTUS）：使用 --multimodal，caption 作为 attachment
caption 质量：text-only baseline 对图像的理解完全依赖 caption 质量。如果 image_caption 为空，用户 turn 中不会有任何视觉信息
Arena 数据：observation 恒为空字符串，视觉信息完全来自 image_caption
转换器会自动验证：运行后会调用 load_domain_a_v2_academic 检验输出是否合法