anonymous-penguin commited on 8 days ago

Commit

9c60174

verified ·

1 Parent(s): 2455d57

Initial code release

Browse files

Code for memory-retrieval experiments.

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +163 -0
baselines/MemoChat/LICENSE +21 -0
baselines/MemoChat/README.md +47 -0
baselines/MemoChat/code/codes/api/gpt_2k.py +90 -0
baselines/MemoChat/code/codes/api/gpt_memochat.py +189 -0
baselines/MemoChat/code/codes/api/llm_judge.py +96 -0
baselines/MemoChat/code/codes/eval/eval_instruction_tuning_tasks.py +169 -0
baselines/MemoChat/code/codes/eval/get_model_infer_memochat.py +291 -0
baselines/MemoChat/code/codes/eval/get_model_infer_simple.py +150 -0
baselines/MemoChat/code/codes/train/data_preprocess.py +118 -0
baselines/MemoChat/code/codes/train/train.py +150 -0
baselines/MemoChat/code/configs/ds_config_13b.json +53 -0
baselines/MemoChat/code/configs/ds_config_33b.json +57 -0
baselines/MemoChat/code/configs/ds_config_3b.json +39 -0
baselines/MemoChat/code/configs/ds_config_7b.json +49 -0
baselines/MemoChat/code/scripts/llm_judge.sh +35 -0
baselines/MemoChat/code/scripts/memochat.sh +34 -0
baselines/MemoChat/code/scripts/memochat_gpt.sh +18 -0
baselines/MemoChat/code/scripts/tuning.sh +110 -0
baselines/MemoChat/core_requirement.txt +13 -0
baselines/MemoChat/run_memochat_baseline.py +634 -0
baselines/raptor/LICENSE.txt +21 -0
baselines/raptor/README.md +204 -0
baselines/raptor/raptor/EmbeddingModels.py +37 -0
baselines/raptor/raptor/FaissRetriever.py +201 -0
baselines/raptor/raptor/QAModels.py +185 -0
baselines/raptor/raptor/RetrievalAugmentation.py +306 -0
baselines/raptor/raptor/Retrievers.py +8 -0
baselines/raptor/raptor/SummarizationModels.py +74 -0
baselines/raptor/raptor/__init__.py +16 -0
baselines/raptor/raptor/cluster_tree_builder.py +151 -0
baselines/raptor/raptor/cluster_utils.py +185 -0
baselines/raptor/raptor/tree_builder.py +369 -0
baselines/raptor/raptor/tree_retriever.py +327 -0
baselines/raptor/raptor/tree_structures.py +28 -0
baselines/raptor/raptor/utils.py +208 -0
baselines/raptor/requirements.txt +11 -0
baselines/raptor/run_raptor_baseline.py +511 -0
baselines/read-agent/read_agent_demo.ipynb +976 -0
baselines/read-agent/run_readagent_baseline.py +424 -0
evaluate_qa.py +916 -0
main.py +1717 -0
memory/__init__.py +2 -0
memory/episodic_store.py +62 -0
memory/semantic_store.py +87 -0
model_zoo.py +31 -0
prompts/agentic_retrieval_prompt.txt +226 -0
prompts/agentic_retrieval_prompt_wo_profile.txt +203 -0
prompts/keyword_search_prompt.txt +31 -0
prompts/read_and_extract_prompt.txt +176 -0

README.md ADDED Viewed

	@@ -0,0 +1,163 @@

+# Long-Term Memory Retrieval Benchmark
+Code release for the experiments described in the accompanying paper:
+- **Hierarchical memory** organization (User Profile / Semantic / Episodic).
+- **Plan-Act-Read agentic retrieval** that interleaves keyword, time-filter,
+  and embedding search.
+- **Flat / dense / oracle baselines** for comparison.
+## Repository layout
+```
+.
+├── main.py                       # End-to-end QA pipeline (agent, embed, keyword modes)
+├── evaluate_qa.py                # Atomic-rubric QA evaluator (strict + partial)
+├── model_zoo.py                  # Model registry
+├── prompts/                      # Prompt templates
+│   ├── agentic_retrieval_prompt.txt
+│   ├── agentic_retrieval_prompt_wo_profile.txt
+│   ├── keyword_search_prompt.txt
+│   └── read_and_extract_prompt.txt
+├── memory/                       # Episodic + semantic memory stores
+├── baselines/
+│   ├── MemoChat/                 # MemoChat baseline (upstream code + our wrapper)
+│   ├── raptor/                   # RAPTOR baseline (upstream code + our wrapper)
+│   └── read-agent/               # ReadAgent baseline wrapper
+├── scripts/
+│   ├── build_retrieval_cache.py  # Pre-compute GTE-7B embeddings for the corpus
+│   ├── make_v5_shards.py         # Deterministic shard split by question_id
+│   ├── merge_jsonl_by_dataset_order.py
+│   ├── run_oracle_qa.py          # Gold-session-only upper bound
+│   ├── plot_main_results.py
+│   ├── llm_judge_agreement.py
+│   └── slurm/
+│       ├── example_dense_retrieval.slurm
+│       └── example_agentic_retrieval.slurm
+└── requirements.txt
+```
+The benchmark dataset (`evolv_mem_v5.json`) is released separately; place it
+under `dataset/` along with the supporting files referenced by `main.py`
+(`all_sessions.json`, `all_session_summary.json`, etc.).
+## Setup
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+```
+### API keys
+The pipeline calls LLMs through three optional providers; set whichever you
+plan to use:
+| Provider                                       | Env var              | Flag         |
+|------------------------------------------------|----------------------|--------------|
+| OpenAI-compatible inference API                | `NV_API_KEY`         | `--nvidia`   |
+| OpenAI-compatible LiteLLM proxy                | `LITELLM_API_KEY`    | `--tritonai` |
+| Direct Anthropic API                           | `ANTHROPIC_API_KEY`  | (default)    |
+| Azure OpenAI                                   | `AZURE_OPENAI_KEY`   | (default)    |
+Each `--<flag>` selects which client the pipeline uses; entries in
+`model_zoo.py` are tagged accordingly.
+## Quick start
+### 1. Build the per-question retrieval cache (one-time)
+```bash
+python scripts/build_retrieval_cache.py \
+    --dataset dataset/evolv_mem_v5.json \
+    --all_sessions dataset/all_sessions.json \
+    --out_dir response_cache/retrieval/
+```
+### 2. Shard the dataset for parallel runs
+```bash
+python scripts/make_v5_shards.py \
+    --dataset dataset/evolv_mem_v5.json \
+    --ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \
+    --out_dir output/shards/v5_run_nchunks10/ \
+    --num_shards 8
+```
+### 3. Run the QA pipeline
+Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing):
+```bash
+export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl"
+python main.py \
+    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
+    --out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \
+    --model_name gpt-5.5 \
+    --top_k 20 \
+    --n_chunks 10 \
+    --nvidia \
+    --all_sessions_file dataset/all_sessions.json \
+    --no_semantic \
+    --mode embed
+```
+Agentic retrieval over hierarchical memory:
+```bash
+python main.py \
+    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
+    --out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \
+    --model_name gpt-5.5 \
+    --top_k 20 \
+    --n_chunks 10 \
+    --nvidia \
+    --all_sessions_file dataset/all_sessions.json \
+    --hier_v2 --hier_union \
+    --mode agent
+```
+To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit
+`scripts/slurm/example_dense_retrieval.slurm` or
+`scripts/slurm/example_agentic_retrieval.slurm`.
+### 4. Merge shards and evaluate
+```bash
+python scripts/merge_jsonl_by_dataset_order.py \
+    --dataset dataset/evolv_mem_v5.json \
+    --parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \
+    --out_file output/v5_run_dense_gte_topk20.jsonl
+python evaluate_qa.py \
+    --hyp_file output/v5_run_dense_gte_topk20.jsonl \
+    --ref_file dataset/evolv_mem_v5.json \
+    --eval_model_name gpt-5.2 \
+    --eval_mode both \
+    --nvidia
+```
+The evaluator caches an atomic-rubric per question
+(`<dataset>.atomic-v1.rubric.json`) so subsequent runs reuse it.
+## Pipeline modes
+`main.py --mode` selects how a question is answered:
+- `embed`: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer.
+- `keyword`: LLM-generated keywords + lexical matching, then answer.
+- `agent`: Plan-Act-Read loop. Combines `--hier_v2` (semantic-summary stage) and
+  `--hier_union` (union with flat top-K) for the hierarchical-memory variant.
+`--no_semantic` disables the semantic-summary memory layer (flat memory).
+## Baselines
+The three external baselines (MemoChat, RAPTOR, ReadAgent) live under
+`baselines/` together with our thin wrappers
+(`run_<baseline>_baseline.py`). Each baseline's upstream LICENSE is preserved.
+## License
+This repository is released under the license stated in the corresponding
+LICENSE file (TBD prior to release). Upstream baselines retain their original
+licenses.

baselines/MemoChat/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Lu junru
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

baselines/MemoChat/README.md ADDED Viewed

	@@ -0,0 +1,47 @@

+# MemoChat
+MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-domain Conversation
+## Environment
+We provide [core_requirement.txt](core_requirement.txt) for your convenience.
+## Model Weights
+The initial models we used are [fastchat models (v1.3)](https://lmsys.org/blog/2023-03-30-vicuna/). Below are the model weights of our fine-tuned version. Our models are built upon Fastchat modles, thus we adopt same `cc-by-nc-sa-4.0` license.
+| Name | Share Link |
+| --- | --- |
+| MemoChat-Fastchat-T5-3B | https://huggingface.co/Junrulu/MemoChat-Fastchat-T5-3B |
+| MemoChat-Vicuna-7B | https://huggingface.co/Junrulu/MemoChat-Vicuna-7B |
+| MemoChat-Vicuna-13B | https://huggingface.co/Junrulu/MemoChat-Vicuna-13B |
+| MemoChat-Vicuna-33B | https://huggingface.co/Junrulu/MemoChat-Vicuna-33B |
+## Workflow
+`RootPath` is the absolute path of this repo. Download initial models and put them in [model](model) folder.
+### Instruction Tuning
+```
+Run `bash code/scripts/tuning.sh RootPath`. Intermediate evaluation are included in this script as well.
+```
+### MemoChat Testing
+```
+Run `bash code/scripts/memochat.sh RootPath` for pipeline testing with fine-tuned models.
+Run `bash code/scripts/memochat_gpt.sh RootPath` for pipeline testing with GPT3.5 API.
+Run `bash code/scripts/llm_judge.sh RootPath` for GPT4 judge (openai api is required).
+```
+### Our Results
+We provide our prediction results [here](https://drive.google.com/file/d/1jGNhT3iPXEA8B2fXHZ2Einy1AMre-8xB/view?usp=sharing).
+## Acknowledgement
+We thank [Vicuna project](https://github.com/lm-sys/FastChat/tree/main) for their great work.
+## Citation
+```
+@misc{lu2023memochat,
+      title={MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation},
+      author={Junru Lu and Siyu An and Mingbao Lin and Gabriele Pergola and Yulan He and Di Yin and Xing Sun and Yunsheng Wu},
+      year={2023},
+      eprint={2308.08239},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```

baselines/MemoChat/code/codes/api/gpt_2k.py ADDED Viewed

	@@ -0,0 +1,90 @@

+import re
+import json
+import openai
+import time
+import sys
+import tiktoken
+input_data = sys.argv[1]
+openai_modelid = sys.argv[2]
+openai.api_key = sys.argv[3]
+output_path = sys.argv[4]
+prompt_path = sys.argv[5]
+encoding = tiktoken.encoding_for_model(openai_modelid)
+q_pre = ""
+qa_link = ""
+MaxLen = 2048
+TarLen = 512
+TaskTarLen = {
+    "chatting_dialogsum": MaxLen,
+    "chatting_alpacagpt4": MaxLen,
+    "writing_topiocqa": TarLen // 2,
+    "writing_dialogsum": TarLen,
+    "retrieval_dialogsum": 32,
+    "retrieval_topiocqa": 32
+}
+prompts = json.load(open(prompt_path, "r"))
+def normalize_chatting_outputs(model_outputs):
+    def white_space_fix(text):
+        lines = text.split("\n")
+        result = []
+        for line in lines:
+            result.append(' '.join(line.split()))
+        output = '\n'.join(result)
+        return output
+    return white_space_fix(model_outputs)
+def gen_model_output(input_qs, task_type):
+    input_qs_token_l = len(encoding.encode(input_qs))  # token num
+    input_qs_word_l = len(input_qs.split(" "))  # word num
+    qs_w_t_ratio = input_qs_word_l / input_qs_token_l
+    max_word_num = int((MaxLen - TarLen) * qs_w_t_ratio)
+    input_qs = " ".join(input_qs.split(" ")[-max_word_num:])
+    target_len = TaskTarLen[task_type]
+    messages = [{"role": "system", "content": input_qs}]
+    for _ in range(5):
+        try:
+            chat = openai.ChatCompletion.create(
+                model=openai_modelid, messages=messages, max_tokens=target_len, temperature=0.2
+            )
+            break
+        except:
+            time.sleep(5)
+    model_outputs = chat.choices[0].message.content
+    return model_outputs
+def run_eval():
+    data = json.load(open(input_data, "r"))
+    output_data = []
+    for d in data:
+        print("=" * 20 + "start of question {}".format(d["id"]) + "=" * 20)
+        new_d = d
+        history = []
+        for l_i in range(len(new_d["conversations"])):
+            if l_i % 2 == 1:
+                bot_thinking = {"retrieval": "", "summarization": ""}
+                print("=" * 20 + "start of turn {}".format(l_i // 2 + 1) + "=" * 20)
+                user = "user: " + new_d["conversations"][l_i - 1]["value"]
+                system_insturction = prompts["chatting"]["system"]
+                task_instruction = prompts["chatting"]["instruction"]
+                task_case = "```\nRecent Dialogs:\n" + " ### ".join([hrd.replace("\n", " ") for hrd in history]) + "\n```\n\nUser Input:\n" + user + " ### bot: "
+                qs = system_insturction + task_case + task_instruction
+                print(qs + "\n\n")
+                outputs = gen_model_output(qs, "chatting_dialogsum")
+                outputs = normalize_chatting_outputs(outputs)
+                history += [user, "bot: " + outputs]
+                print("bot: " + outputs + "\n")
+                print("=" * 20 + "end of turn {}".format(l_i // 2 + 1) + "=" * 20)
+                new_d["conversations"][l_i]["thinking"] = json.dumps(bot_thinking)
+                new_d["conversations"][l_i]["value"] = outputs
+        output_data.append(new_d)
+    json.dump(output_data, open(output_path, "w"), indent=2)
+if __name__ == "__main__":
+    run_eval()

baselines/MemoChat/code/codes/api/gpt_memochat.py ADDED Viewed

	@@ -0,0 +1,189 @@

+import re
+import json
+import openai
+import time
+import sys
+import tiktoken
+from random import sample
+input_data = sys.argv[1]
+openai_modelid = sys.argv[2]
+openai.api_key = sys.argv[3]
+output_path = sys.argv[4]
+prompt_path = sys.argv[5]
+encoding = tiktoken.encoding_for_model(openai_modelid)
+q_pre = ""
+qa_link = ""
+MaxLen = 2048
+TarLen = 512
+TaskTarLen = {
+    "chatting_dialogsum": MaxLen,
+    "chatting_alpacagpt4": MaxLen,
+    "writing_topiocqa": TarLen // 2,
+    "writing_dialogsum": TarLen,
+    "retrieval_dialogsum": 32,
+    "retrieval_topiocqa": 32
+}
+prompts = json.load(open(prompt_path, "r"))
+def normalize_model_outputs(model_text):
+    extracted_elements = [re.sub(r'\s+', ' ', mt.replace('"', '').replace("'", "")) for mt in re.findall(r"'[^']*'|\"[^\"]*\"|\d+", model_text)]
+    model_outputs = []
+    ti = 0
+    while ti + 7 < len(extracted_elements):
+        if extracted_elements[ti] == "topic" and extracted_elements[ti + 2] == "summary" and extracted_elements[ti + 4] == "start" and extracted_elements[ti + 6] == "end":
+            try:
+                model_outputs.append({"topic": extracted_elements[ti + 1], "summary": extracted_elements[ti + 3], "start": int(extracted_elements[ti + 5]), "end": int(extracted_elements[ti + 7])})
+            except:
+                pass
+        ti += 1
+    return model_outputs
+def normalize_chatting_outputs(model_outputs):
+    def white_space_fix(text):
+        lines = text.split("\n")
+        result = []
+        for line in lines:
+            result.append(' '.join(line.split()))
+        output = '\n'.join(result)
+        return output
+    return white_space_fix(model_outputs)
+def gen_model_output(input_qs, task_type):
+    input_qs_token_l = len(encoding.encode(input_qs))  # token num
+    input_qs_word_l = len(input_qs.split(" "))  # word num
+    qs_w_t_ratio = input_qs_word_l / input_qs_token_l
+    max_word_num = int((MaxLen - TarLen) * qs_w_t_ratio)
+    input_qs = " ".join(input_qs.split(" ")[-max_word_num:])
+    target_len = TaskTarLen[task_type]
+    messages = [{"role": "system", "content": input_qs}]
+    for _ in range(5):
+        try:
+            chat = openai.ChatCompletion.create(
+                model=openai_modelid, messages=messages, max_tokens=target_len, temperature=0.2
+            )
+            break
+        except:
+            time.sleep(5)
+    model_outputs = chat.choices[0].message.content
+    return model_outputs
+def run_summary(history, memo, bot_thinking):
+    system_insturction = prompts["writing_dialogsum"]["system"]
+    task_instruction = prompts["writing_dialogsum"]["instruction"]
+    history_log = "\n\n```\nTask Conversation:\n" + "\n".join(["(line {}) {}".format(h_i + 1, h.replace("\n", " ")) for h_i, h in enumerate(history["Recent Dialogs"][2:])])
+    qs = q_pre + system_insturction.replace("LINE", str(len(history["Recent Dialogs"]) - 2)) + history_log + "\n```" + task_instruction.replace("LINE", str(len(history["Recent Dialogs"]) - 2)) + qa_link
+    # print("-" * 20 + "summarizing" + "-" * 20)
+    # print(qs)
+    # print("-" * 20 + "summarizing" + "-" * 20)
+    sum_history = gen_model_output(qs, "writing_dialogsum")
+    sum_history = normalize_model_outputs(sum_history)
+    # print("-" * 20 + "summarization" + "-" * 20)
+    # print(sum_history)
+    # print("-" * 20 + "summarization" + "-" * 20)
+    for s in sum_history:
+        memo[s["topic"]] = memo.get(s["topic"], []) + [{"summary": s["summary"], "dialogs": history["Recent Dialogs"][2:][(s["start"] - 1):s["end"]]}]
+    if len(sum_history) == 0:
+        si_0, si_1 = sample(list(range(len(history["Recent Dialogs"][2:]))), 2)
+        memo["NOTO"].append({"summary": "Partial dialogs about: {} or {}.".format(history["Recent Dialogs"][2:][si_0], history["Recent Dialogs"][2:][si_1]), "dialogs": history["Recent Dialogs"][2:]})
+    history["Recent Dialogs"] = history["Recent Dialogs"][-2:]
+    bot_thinking["summarization"] = {"input": qs, "output": sum_history}
+    return history, memo, bot_thinking
+def run_retrieval(history, memo, bot_thinking):
+    topics = []
+    for k, v in memo.items():
+        for vv in v:
+            topics.append((k, vv["summary"], vv["dialogs"]))
+    system_insturction = prompts["retrieval"]["system"]
+    task_instruction = prompts["retrieval"]["instruction"]
+    task_case = "```\nQuery Sentence:\n" + history["User Input"][6:] + "\nTopic Options:\n" + \
+                "\n".join(["({}) {}".format(v_i + 1, v[0] + ". " + v[1]) for v_i, v in enumerate(topics)]) + "\n```"
+    qs = q_pre + system_insturction.replace("OPTION", str(len(topics))) + task_case + task_instruction.replace("OPTION", str(len(topics))) + qa_link
+    # print("-" * 20 + "retrieving" + "-" * 20)
+    # print(qs)
+    # print("-" * 20 + "retrieving" + "-" * 20)
+    outputs = gen_model_output(qs, "retrieval_dialogsum")
+    # print("-" * 20 + "retrieval" + "-" * 20)
+    # print(outputs)
+    # print("-" * 20 + "retrieval" + "-" * 20)
+    outputs = outputs.split("#")
+    chosen_topics = []
+    for output in outputs:
+        try:
+            index_ = int(output) - 1
+        except:
+            continue
+        if index_ < len(topics) and "NOTO" not in topics[index_]:
+            chosen_topics.append(topics[index_])
+    if len(chosen_topics) > 0:
+        history["Related Topics"] = [ct[0] for ct in chosen_topics]
+        history["Related Summaries"] = [ct[1] for ct in chosen_topics]
+        history["Related Dialogs"] = [" ### ".join(ct[2]) for ct in chosen_topics]
+    else:
+        history["Related Topics"] = []
+        history["Related Summaries"] = []
+        history["Related Dialogs"] = []
+    bot_thinking["retrieval"] = {"input": qs, "output": outputs}
+    return history, bot_thinking
+def run_eval():
+    data = json.load(open(input_data, "r"))
+    output_data = []
+    for d in data:
+        print("=" * 20 + "start of question {}".format(d["id"]) + "=" * 20)
+        new_d = d
+        history = {
+            "Recent Dialogs": ["user: Hi!", "bot: Hi! How can I help you today?"],
+            "Related Topics": [],
+            "Related Summaries": [],
+            "Related Dialogs": [],
+            "User Input": "",
+        }
+        memo = {
+            "NOTO": [{"summary": "None of the others.", "dialogs": []}]
+        }
+        for l_i in range(len(new_d["conversations"])):
+            if l_i % 2 == 1:
+                bot_thinking = {"retrieval": "", "summarization": ""}
+                print("=" * 20 + "start of turn {}".format(l_i // 2 + 1) + "=" * 20)
+                user = "user: " + new_d["conversations"][l_i - 1]["value"]
+                print(user + "\n\n")
+                # create summary if recent dialogs exceed threshold
+                if len(" ### ".join(history["Recent Dialogs"]).split(" ")) > (MaxLen // 2) or len(history["Recent Dialogs"]) >= 10:
+                    history, memo, bot_thinking = run_summary(history, memo, bot_thinking)
+                # retrieve most related topics for every new user input
+                history["User Input"] = user
+                if len(memo.keys()) > 1:
+                    history, bot_thinking = run_retrieval(history, memo, bot_thinking)
+                # generate bot response
+                system_insturction = prompts["chatting"]["system"]
+                task_instruction = prompts["chatting"]["instruction"]
+                task_case = "```\nRelated Evidences:\n" + "\n".join(["({}) {}".format(r_tsd_i + 1, {
+                                "Related Topics": history["Related Topics"][r_tsd_i],
+                                "Related Summaries": history["Related Summaries"][r_tsd_i],
+                                "Related Dialogs": history["Related Dialogs"][r_tsd_i]
+                            }) for r_tsd_i in range(len(history["Related Topics"]))]) + "\n\nRecent Dialogs:\n" + \
+                            " ### ".join([hrd.replace("\n", " ") for hrd in history["Recent Dialogs"]]) + "\n```\n\nUser Input:\n" + history["User Input"] + " ### bot: "
+                qs = q_pre + system_insturction + task_case + task_instruction + qa_link
+                outputs = gen_model_output(qs, "chatting_dialogsum")
+                outputs = normalize_chatting_outputs(outputs)
+                history["Recent Dialogs"] += [user, "bot: " + outputs]
+                print("bot: " + outputs + "\n")
+                print("=" * 20 + "end of turn {}".format(l_i // 2 + 1) + "=" * 20)
+                # print("\n\n\n\n")
+                new_d["conversations"][l_i]["thinking"] = json.dumps(bot_thinking)
+                new_d["conversations"][l_i]["value"] = outputs
+        output_data.append(new_d)
+    json.dump(output_data, open(output_path, "w"), indent=2)
+if __name__ == "__main__":
+    run_eval()

baselines/MemoChat/code/codes/api/llm_judge.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import sys
+import os
+import json
+import re
+import openai
+import time
+import tiktoken
+input_data = sys.argv[1]
+openai_modelid = sys.argv[2]
+openai.api_key = sys.argv[3]
+output_path = sys.argv[4]
+prompt_path = sys.argv[5]
+encoding = tiktoken.encoding_for_model(openai_modelid)
+prompts = json.load(open(prompt_path, "r"))
+judge_prompt_raw = prompts["judge"]["system"]
+def gen_model_output(input_qs):
+    input_qs_token_l = len(encoding.encode(input_qs))  # token num
+    input_qs_word_l = len(input_qs.split(" "))  # word num
+    qs_w_t_ratio = input_qs_word_l / input_qs_token_l
+    max_word_num = int(4096 * qs_w_t_ratio)
+    input_qs = " ".join(input_qs.split(" ")[-max_word_num:])
+    messages = [{"role": "system", "content": input_qs}]
+    chat = None
+    for _ in range(5):
+        try:
+            chat = openai.ChatCompletion.create(
+                model=openai_modelid, messages=messages
+            )
+            break
+        except:
+            time.sleep(5)
+    if chat is None:
+        return "Cannot generate output."
+    model_outputs = chat.choices[0].message.content
+    return model_outputs
+data = json.load(open(input_data, "r"))
+# do llm judge
+output_ratings = []
+for d in data:
+    print("=" * 20 + "Processing: " + d["id"] + "=" * 20)
+    judge_conversation = []
+    d_conversations = d['conversations']
+    last_q = d_conversations[-2]
+    turn_infos = last_q["turn-info"].split("-")
+    r_turns = [turn_infos[0] + "-" + turn_infos[1]]
+    if len(turn_infos) == 5:
+        r_turns.append(turn_infos[2] + "-" + turn_infos[3])
+    for l_i in range(len(d_conversations) // 2 - 1):
+        if d_conversations[l_i * 2]["turn-info"][:-2] in r_turns:
+            judge_conversation.append("user: " + d_conversations[l_i * 2]["value"])
+            judge_conversation.append("bot: " + d_conversations[l_i * 2 + 1]["value"])
+    judge_prompt = judge_prompt_raw.replace("RCH_0", "\n".join(judge_conversation)).replace("UQ_1", "user: " + last_q["value"]).replace("BR_2", "bot: " + d_conversations[-1]["value"])
+    print(judge_prompt)
+    print('-' * 20)
+    outputs = gen_model_output(judge_prompt)
+    print(outputs)
+    print("=" * 20 + "Processed: " + d["id"] + "=" * 20)
+    match = re.search(r'\[\[(\d+)\]\]', outputs)
+    try:
+        rating = int(match.group(1))
+    except:
+        rating = None
+    output_ratings.append({
+        "id": d["id"],
+        "type": d["type"],
+        "judge_prompt": judge_prompt,
+        "evaluation": outputs,
+        "rating": rating
+    })
+json.dump(output_ratings, open(output_path, "w"), indent=2)
+# compute score
+count = {
+    "continuation": [],
+    "retrospection": [],
+    "conjunction": []
+}
+for d in output_ratings:
+    if d["type"] == "continuation":
+        count["continuation"].append(d["rating"])
+    elif d["type"] == "retrospection":
+        count["retrospection"].append(d["rating"])
+    elif d["type"] == "conjunction":
+        count["conjunction"].append(d["rating"])
+print("Retrospection Score: {}, Continuation Score: {}, Conjunction Score: {}, Overall Score: {} of file {}".format(
+    round(sum(count["retrospection"]) / len(count["retrospection"]), 2),
+    round(sum(count["continuation"]) / len(count["continuation"]), 2),
+    round(sum(count["conjunction"]) / len(count["conjunction"]), 2),
+    round(sum(count["continuation"] + count["retrospection"] + count["conjunction"]) / len(count["continuation"] + count["retrospection"] + count["conjunction"]), 2),
+    input_data
+))

baselines/MemoChat/code/codes/eval/eval_instruction_tuning_tasks.py ADDED Viewed

	@@ -0,0 +1,169 @@

+import json
+import re
+import string
+import sys
+import random
+from argparse import ArgumentParser
+from collections import Counter
+from evaluate import load
+bertscore = load("bertscore")
+refer_file_path = sys.argv[1]
+input_file_path = sys.argv[2]
+conversations = open(refer_file_path, "r").readlines()
+conversations_dict = {}
+for conversation in conversations:
+    conv_l = json.loads(conversation.strip())
+    conversations_dict[conv_l["question_id"]] = (conv_l["text"], conv_l["answer"], conv_l["type"])
+class Metrics():
+    def __init__(self):
+        pass
+    def __normalize_text(self, s_text):
+        """Lower text and remove punctuation, storys and extra whitespace."""
+        def remove_articles(text):
+            regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
+            return re.sub(regex, ' ', text)
+        def white_space_fix(text):
+            return ' '.join(text.split())
+        def remove_punc(text):
+            exclude = set(string.punctuation)
+            return ''.join(ch for ch in text if ch not in exclude)
+        def lower(text):
+            return text.lower()
+        return white_space_fix(remove_articles(remove_punc(lower(s_text))))
+    def __normalize_model_outputs(self, model_text, type_category):
+        """post process of memo writing outputs"""
+        extracted_elements = [re.sub(r'\s+', ' ', mt.replace('"', '').replace("'", "")) for mt in re.findall(r"'[^']*'|\"[^\"]*\"|\d+", model_text)]
+        model_outputs = []
+        ti = 0
+        if "dialogsum" in type_category:
+            while ti + 7 < len(extracted_elements):
+                if extracted_elements[ti] == "topic" and extracted_elements[ti + 2] == "summary" and extracted_elements[ti + 4] == "start" and extracted_elements[ti + 6] == "end":
+                    try:
+                        model_outputs.append({"topic": extracted_elements[ti + 1], "summary": extracted_elements[ti + 3], "start": int(extracted_elements[ti + 5]), "end": int(extracted_elements[ti + 7])})
+                    except:
+                        pass
+                ti += 1
+        else:
+            while ti + 5 < len(extracted_elements):
+                if extracted_elements[ti] == "topic" and extracted_elements[ti + 2] == "start" and extracted_elements[ti + 4] == "end":
+                    try:
+                        model_outputs.append({"topic": extracted_elements[ti + 1], "start": int(extracted_elements[ti + 3]), "end": int(extracted_elements[ti + 5])})
+                    except:
+                        pass
+                ti += 1
+        return model_outputs
+    def __get_class_span_dict__(self, label, checkitem_k):
+        class_span = {}
+        for i in range(len(label)):
+            checkitem_i = self.__normalize_text(label[i][checkitem_k])
+            class_span[(label[i]['start'], label[i]['end'])] = class_span.get((label[i]['start'], label[i]['end']), []) + [checkitem_i]
+        return class_span
+    def __get_intersect_by_entity__(self, pred_class_span, label_class_span):
+        '''
+        return the count of correct entity
+        '''
+        cnt = 0
+        for label in label_class_span:
+            cnt += len(list(set(label_class_span[label]).intersection(set(pred_class_span.get(label,[])))))
+        return cnt
+    def __get_bertscore_by_entity__(self, pred_class_span, label_class_span):
+        '''
+        return the count of correct entity
+        '''
+        cnt = 0
+        for label in label_class_span:
+            if label in pred_class_span:
+                references = [label_class_span[label]]
+                prediction = [pred_class_span[label][0]]
+                result = bertscore.compute(predictions=prediction, references=references, model_type="microsoft/deberta-xlarge-mnli")["precision"][0]
+                cnt += result
+        return cnt
+    def __get_cnt__(self, label_class_span):
+        '''
+        return the count of entities
+        '''
+        cnt = 0
+        for label in label_class_span:
+            cnt += len(label_class_span[label])
+            # cnt += 1  # set as 1 if we have multiple references
+        return cnt
+    def metrics_by_entity_(self, pred, label, checkitem_k):
+        '''
+        return entity level count of total prediction, true labels, and correct prediction
+        '''
+        pred_class_span = self.__get_class_span_dict__(pred, checkitem_k)
+        label_class_span = self.__get_class_span_dict__(label, checkitem_k)
+        pred_cnt = self.__get_cnt__(pred_class_span)
+        label_cnt = self.__get_cnt__(label_class_span)
+        if checkitem_k == "topic":
+            correct_cnt = self.__get_intersect_by_entity__(pred_class_span, label_class_span)
+        elif checkitem_k == "summary":
+            correct_cnt = self.__get_bertscore_by_entity__(pred_class_span, label_class_span)
+        return pred_cnt, label_cnt, correct_cnt
+    def p_r_f1_by_entity(self, pc, lc, cc):
+        precision = cc / (pc + 1e-8)
+        recall = cc / (lc + 1e-8)
+        f1 = 2 * precision * recall / (precision + recall + 1e-8)
+        return round(precision * 100, 2), round(recall * 100, 2), round(f1 * 100, 2)
+    def metrics_by_entity_files(self, pred_file, checkitem_k, type_key):
+        pred_cnt = 0
+        label_cnt = 0
+        correct_cnt = 0
+        for l_i, line in enumerate(open(pred_file, "r").readlines()):
+            eles = json.loads(line.strip())
+            if (type_key not in conversations_dict[eles["question_id"]][2]) or (conversations_dict[eles["question_id"]][2] == "writing_topiocqa" and checkitem_k == "summary"):
+                continue
+            if type_key == "writing":
+                model_text = self.__normalize_model_outputs(eles["text"], conversations_dict[eles["question_id"]][2])
+                label_i = json.loads(conversations_dict[eles["question_id"]][1])
+            elif type_key == "retrieval":
+                model_text = [{"topic": v, "start": 0, "end": 0} for v in set(eles["text"].split("#"))]
+                label_i = [{"topic": v, "start": 0, "end": 0} for v in set(conversations_dict[eles["question_id"]][1].split("#"))]
+            else:
+                model_text = [{"summary": eles["text"], "start": 0, "end": 0}]
+                label_i = [{"summary": conversations_dict[eles["question_id"]][1], "start": 0, "end": 0}]
+            p_cnt, l_cnt, c_cnt = self.metrics_by_entity_(model_text, label_i, checkitem_k)
+            p_i, r_i, f_i = self.p_r_f1_by_entity(p_cnt, l_cnt, c_cnt)
+            # if p_i + r_i + f_i != 0:
+            #     print("Q ID: " + str(eles["question_id"]) + "\n")
+            #     print(conversations_dict[eles["question_id"]][0] + "\n")
+            #     # print("Raw Ouput: " + eles["text"] + "\n")
+            #     print("Model: {}".format(model_text) + "\n")
+            #     print("Refer: {}".format(label_i) + "\n")
+            #     print("Case P/R/F1: {}%, {}%, {}%".format(p_i, r_i, f_i))
+            #     print("=" * 20)
+            pred_cnt += p_cnt
+            label_cnt += l_cnt
+            correct_cnt += c_cnt
+        return self.p_r_f1_by_entity(pred_cnt, label_cnt, correct_cnt)
+calculate_metrics = Metrics()
+p_a, r_a, f1_a = calculate_metrics.metrics_by_entity_files(input_file_path, 'topic', 'writing')  # both
+print("Overall P/R/F1 of topic: {}%, {}%, {}%".format(p_a, r_a, f1_a))
+p_b, r_b, f1_b = calculate_metrics.metrics_by_entity_files(input_file_path, 'summary', 'writing')  # dialogsum
+print("Overall P/R/F1 of summary: {}%, {}%, {}%".format(p_b, r_b, f1_b))
+_, _, f1 = calculate_metrics.metrics_by_entity_files(input_file_path, "topic", "retrieval")  # both
+print("Retrival F1: {}%".format(f1))
+p, _, _ = calculate_metrics.metrics_by_entity_files(input_file_path, "summary", "chatting")  # dialogsum
+print("Chatting similarity: {}%".format(p))

baselines/MemoChat/code/codes/eval/get_model_infer_memochat.py ADDED Viewed

	@@ -0,0 +1,291 @@

+import argparse
+from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaForCausalLM, AutoModelForSeq2SeqLM
+from optimum.bettertransformer import BetterTransformer
+import torch
+import os
+import json
+import re
+import ray
+import warnings
+from random import sample
+warnings.filterwarnings("ignore")
+q_pre = "<s>\n"
+qa_link = "\n"
+MaxLen = 2048
+TarLen = 512
+TaskTarLen = {
+    "chatting_dialogsum": MaxLen,
+    "chatting_alpacagpt4": MaxLen,
+    "writing_topiocqa": TarLen // 2,
+    "writing_dialogsum": TarLen,
+    "retrieval_dialogsum": 32,
+    "retrieval_topiocqa": 32
+}
+def get_gpu_memory(num_gpus):
+    """Get available memory for each GPU."""
+    gpu_memory = []
+    for gpu_id in range(num_gpus):
+        with torch.cuda.device(gpu_id):
+            device = torch.cuda.current_device()
+            gpu_properties = torch.cuda.get_device_properties(device)
+            total_memory = gpu_properties.total_memory / (1024**3)
+            allocated_memory = torch.cuda.memory_allocated() / (1024**3)
+            available_memory = total_memory - allocated_memory
+            gpu_memory.append(available_memory)
+    return gpu_memory
+def normalize_model_outputs(model_text):
+    """post processing function of memo writing task"""
+    extracted_elements = [re.sub(r'\s+', ' ', mt.replace('"', '').replace("'", "")) for mt in re.findall(r"'[^']*'|\"[^\"]*\"|\d+", model_text)]
+    model_outputs = []
+    ti = 0
+    while ti + 7 < len(extracted_elements):
+        if extracted_elements[ti] == "topic" and extracted_elements[ti + 2] == "summary" and extracted_elements[ti + 4] == "start" and extracted_elements[ti + 6] == "end":
+            try:
+                model_outputs.append({"topic": extracted_elements[ti + 1], "summary": extracted_elements[ti + 3], "start": int(extracted_elements[ti + 5]), "end": int(extracted_elements[ti + 7])})
+            except:
+                pass
+        ti += 1
+    return model_outputs
+def normalize_chatting_outputs(model_outputs):
+    """post processing function of chatting response"""
+    def white_space_fix(text):
+        lines = text.split("\n")
+        result = []
+        for line in lines:
+            result.append(' '.join(line.split()))
+        output = '\n'.join(result)
+        return output
+    return white_space_fix(model_outputs)
+def gen_model_output(model_path, model, tokenizer, input_qs, local_check, task_type):
+    if local_check:
+        from faker import Faker
+        fake = Faker(locale="en")
+        return fake.text(2000)
+    if "writing" in task_type:
+        eos_token_ids = [tokenizer.eos_token_id, tokenizer.encode("]", add_special_tokens=False)[0]]
+    elif "retrieval" in task_type:
+        eos_token_ids = [tokenizer.eos_token_id, tokenizer.encode("\n", add_special_tokens=False)[0], tokenizer.encode(" ", add_special_tokens=False)[0]]
+    else:
+        eos_token_ids = [tokenizer.eos_token_id]
+    if "t5" in model_path:
+        # t5 model may need larger repetition penalty value to help get generation stability
+        input_ids = tokenizer([input_qs], max_length=MaxLen, truncation=True, add_special_tokens=False).input_ids
+        target_len = TaskTarLen[task_type]
+        repetition_penalty_value = 1.0
+    else:
+        input_ids = tokenizer([input_qs], max_length=(MaxLen - TarLen), truncation=True, add_special_tokens=False).input_ids
+        target_len = min(len(input_ids[0]) + TaskTarLen[task_type], MaxLen)
+        repetition_penalty_value = 1.0
+    output_ids = model.generate(
+        torch.as_tensor(input_ids).cuda(),
+        do_sample=True,
+        temperature=0.2,
+        max_length=target_len,
+        eos_token_id=eos_token_ids,
+        repetition_penalty=repetition_penalty_value
+    )
+    if "t5" in model_path:
+        output_ids = output_ids[0]
+    else:
+        output_ids = output_ids[0][len(input_ids[0]):]
+    model_outputs = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
+    return model_outputs
+def run_summary(history, model_path, model, tokenizer, memo, local_check, bot_thinking, prompts):
+    """We assume there's no too long input from user, e.g. over 1500 tokens"""
+    system_insturction = prompts["writing_dialogsum"]["system"]
+    task_instruction = prompts["writing_dialogsum"]["instruction"]
+    history_log = "\n\n```\nTask Conversation:\n" + "\n".join(["(line {}) {}".format(h_i + 1, h.replace("\n", " ")) for h_i, h in enumerate(history["Recent Dialogs"][2:])])
+    qs = q_pre + system_insturction.replace("LINE", str(len(history["Recent Dialogs"]) - 2)) + history_log + "\n```" + task_instruction.replace("LINE", str(len(history["Recent Dialogs"]) - 2)) + qa_link
+    print("#" * 20 + "summarizing" + "#" * 20)
+    print(qs)
+    print("#" * 20 + "summarizing" + "#" * 20)
+    sum_history = gen_model_output(model_path, model, tokenizer, qs, local_check, "writing_dialogsum")
+    sum_history = normalize_model_outputs(sum_history)
+    print("#" * 20 + "summarization" + "#" * 20)
+    print(sum_history)
+    print("#" * 20 + "summarization" + "#" * 20)
+    for s in sum_history:
+        memo[s["topic"]] = memo.get(s["topic"], []) + [{"summary": s["summary"], "dialogs": history["Recent Dialogs"][2:][(s["start"] - 1):s["end"]]}]
+    if local_check:
+        memo["test_topic{}".format(len(memo.keys()))] = [{"summary": "test_summary{}".format(len(memo.keys())), "dialogs": history["Recent Dialogs"][2:][2:4]}]
+    if len(sum_history) == 0:
+        si_0, si_1 = sample(list(range(len(history["Recent Dialogs"][2:]))), 2)
+        memo["NOTO"].append({"summary": "Partial dialogs about: {} or {}.".format(history["Recent Dialogs"][2:][si_0], history["Recent Dialogs"][2:][si_1]), "dialogs": history["Recent Dialogs"][2:]})
+    history["Recent Dialogs"] = history["Recent Dialogs"][-2:]
+    bot_thinking["summarization"] = {"input": qs, "output": sum_history}
+    return history, memo, bot_thinking
+def run_retrieval(history, model_path, model, tokenizer, memo, local_check, bot_thinking, prompts):
+    topics = []
+    for k, v in memo.items():
+        for vv in v:
+            topics.append((k, vv["summary"], vv["dialogs"]))
+    system_insturction = prompts["retrieval"]["system"]
+    task_instruction = prompts["retrieval"]["instruction"]
+    task_case = "```\nQuery Sentence:\n" + history["User Input"][6:] + "\nTopic Options:\n" + \
+                "\n".join(["({}) {}".format(v_i + 1, v[0] + ". " + v[1]) for v_i, v in enumerate(topics)]) + "\n```"
+    qs = q_pre + system_insturction.replace("OPTION", str(len(topics))) + task_case + task_instruction.replace("OPTION", str(len(topics))) + qa_link
+    print("#" * 20 + "retrieving" + "#" * 20)
+    print(qs)
+    print("#" * 20 + "retrieving" + "#" * 20)
+    outputs = gen_model_output(model_path, model, tokenizer, qs, local_check, "retrieval_dialogsum")
+    print("#" * 20 + "retrieval" + "#" * 20)
+    print(outputs)
+    print("#" * 20 + "retrieval" + "#" * 20)
+    outputs = outputs.split("#")
+    chosen_topics = []
+    for output in outputs:
+        try:
+            index_ = int(output) - 1
+        except:
+            continue
+        if index_ < len(topics) and "NOTO" not in topics[index_][0]:
+            chosen_topics.append(topics[index_])
+    if local_check:
+        chosen_topics = sample(topics, min(len(topics) - 1, 2))
+    if len(chosen_topics) > 0:
+        history["Related Topics"] = [ct[0] for ct in chosen_topics]
+        history["Related Summaries"] = [ct[1] for ct in chosen_topics]
+        history["Related Dialogs"] = [" ### ".join(ct[2]) for ct in chosen_topics]
+    else:
+        history["Related Topics"] = []
+        history["Related Summaries"] = []
+        history["Related Dialogs"] = []
+    bot_thinking["retrieval"] = {"input": qs, "output": outputs}
+    return history, bot_thinking
+@torch.inference_mode()
+def get_model_answers(model_path, num_gpus, local_check, load_in_8bit, ques_jsons, prompts):
+    model_path = os.path.expanduser(model_path)
+    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, truncation_side='left')
+    if not local_check:
+        # We assume you have enough GPUs to load one model, but you can modify the gpu_memory_dict to allow CPU offloads
+        # We recommend to use as less as GPUs possible to load one model for faster inference, such as 3 V100 32G or 2 A100 40G for 33B model
+        available_gpu_memory = get_gpu_memory(num_gpus)
+        gpu_memory_dict = {i: str(int(available_gpu_memory[i] * 0.85)) + "GiB" for i in range(num_gpus)}
+        gpu_memory_dict["cpu"] = "0GiB"
+        if "t5" in model_path:
+            model = AutoModelForSeq2SeqLM.from_pretrained(
+                model_path, torch_dtype=torch.float16, device_map="auto", max_memory=gpu_memory_dict, load_in_8bit=load_in_8bit
+            )
+        else:
+            model = AutoModelForCausalLM.from_pretrained(
+                model_path, torch_dtype=torch.float16, device_map="auto", max_memory=gpu_memory_dict, load_in_8bit=load_in_8bit
+            )
+        # Initialize with BetterTransformer, injecting Flash-Attention
+        model = BetterTransformer.transform(model)
+        # turn on eval mode to stop batch normalizarion & dropout, can work together with torch.inference_mode
+        model = model.eval()
+    else:
+        model = None
+    output_data = []
+    for d in ques_jsons:
+        new_d = d
+        history = {
+            "Recent Dialogs": ["user: Hi!", "bot: Hi! How can I help you today?"],
+            "Related Topics": [],
+            "Related Summaries": [],
+            "Related Dialogs": [],
+            "User Input": "",
+        }
+        memo = {
+            "NOTO": [{"summary": "None of the others.", "dialogs": []}]
+        }
+        for l_i in range(len(new_d["conversations"])):
+            if l_i % 2 == 1:
+                bot_thinking = {"retrieval": "", "summarization": ""}
+                print("=" * 20 + "start of turn {}".format(l_i // 2 + 1) + "=" * 20)
+                user = "user: " + new_d["conversations"][l_i - 1]["value"]
+                print(user + "\n\n")
+                # create summary if recent dialogs exceed threshold
+                if len(" ### ".join(history["Recent Dialogs"]).split(" ")) > (MaxLen // 2) or len(history["Recent Dialogs"]) >= 10:
+                    history, memo, bot_thinking = run_summary(history, model_path, model, tokenizer, memo, local_check, bot_thinking, prompts)
+                # retrieve most related topics for every new user input
+                history["User Input"] = user
+                if len(memo.keys()) > 1:
+                    history, bot_thinking = run_retrieval(history, model_path, model, tokenizer, memo, local_check, bot_thinking, prompts)
+                # generate bot response
+                system_insturction = prompts["chatting"]["system"]
+                task_instruction = prompts["chatting"]["instruction"]
+                task_case = "```\nRelated Evidences:\n" + "\n".join(["({}) {}".format(r_tsd_i + 1, {
+                                "Related Topics": history["Related Topics"][r_tsd_i],
+                                "Related Summaries": history["Related Summaries"][r_tsd_i],
+                                "Related Dialogs": history["Related Dialogs"][r_tsd_i]
+                            }) for r_tsd_i in range(len(history["Related Topics"]))]) + "\n\nRecent Dialogs:\n" + \
+                            " ### ".join([hrd.replace("\n", " ") for hrd in history["Recent Dialogs"]]) + "\n```\n\nUser Input:\n" + history["User Input"] + " ### bot: "
+                qs = q_pre + system_insturction + task_case + task_instruction + qa_link
+                outputs = gen_model_output(model_path, model, tokenizer, qs, local_check, "chatting_dialogsum")
+                outputs = normalize_chatting_outputs(outputs)
+                history["Recent Dialogs"] += [user, "bot: " + outputs]
+                print("bot: " + outputs + "\n")
+                print("=" * 20 + "end of turn {}".format(l_i // 2 + 1) + "=" * 20)
+                print("\n\n\n\n")
+                new_d["conversations"][l_i]["thinking"] = json.dumps(bot_thinking)
+                new_d["conversations"][l_i]["value"] = outputs
+        output_data.append(d)
+    return output_data
+def run_eval(model_path, num_gpus, local_check, load_in_8bit, question_file, ray_num_gpus, answer_file, prompt_path):
+    assert num_gpus % ray_num_gpus == 0
+    prompts = json.load(open(prompt_path, "r"))
+    # split question file into num_gpus files
+    ques_jsons = json.load(open(os.path.expanduser(question_file), "r"))
+    chunk_size = len(ques_jsons) // (num_gpus // ray_num_gpus)
+    ans_handles = []
+    for i in range(0, len(ques_jsons), chunk_size):
+        get_answers_func = ray.remote(num_gpus=ray_num_gpus)(
+            get_model_answers
+        ).remote
+        ans_handles.append(
+            get_answers_func(
+                model_path, ray_num_gpus, local_check, load_in_8bit, ques_jsons[i: i + chunk_size], prompts
+            )
+        )
+    ans_jsons = []
+    for ans_handle in ans_handles:
+        ans_jsons.extend(ray.get(ans_handle))
+    json.dump(ans_jsons, open(os.path.expanduser(answer_file), "w"), indent=2)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, required=True)
+    parser.add_argument("--num-gpus", type=int, required=True)
+    parser.add_argument("--ray-num-gpus", type=int, required=True)
+    parser.add_argument("--local-check", action="store_true", help="use faker to generate fake resposne, used for pipeline prompt checking")
+    parser.add_argument("--load-in-8bit", action="store_true")
+    parser.add_argument("--question-file", type=str, required=True)
+    parser.add_argument("--answer-file", type=str, required=True)
+    parser.add_argument("--prompt-path", type=str, required=True)
+    args = parser.parse_args()
+    run_eval(
+        args.model_path,
+        args.num_gpus,
+        args.local_check,
+        args.load_in_8bit,
+        args.question_file,
+        args.ray_num_gpus,
+        args.answer_file,
+        args.prompt_path
+    )

baselines/MemoChat/code/codes/eval/get_model_infer_simple.py ADDED Viewed

	@@ -0,0 +1,150 @@

+import argparse
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
+from optimum.bettertransformer import BetterTransformer
+import torch
+import os
+import json
+from tqdm import tqdm
+import ray
+q_pre = "<s>\n"
+qa_link = "\n"
+a_pos = "\n</s>"
+MaxLen = 2048
+TarLen = 512
+TaskTarLen = {
+    "chatting_dialogsum": MaxLen,
+    "chatting_alpacagpt4": MaxLen,
+    "writing_topiocqa": TarLen // 2,
+    "writing_dialogsum": TarLen,
+    "retrieval_dialogsum": 32,
+    "retrieval_topiocqa": 32
+}
+def get_gpu_memory(ray_num_gpus):
+    """Get available memory for each GPU."""
+    gpu_memory = []
+    for gpu_id in range(ray_num_gpus):
+        with torch.cuda.device(gpu_id):
+            device = torch.cuda.current_device()
+            gpu_properties = torch.cuda.get_device_properties(device)
+            total_memory = gpu_properties.total_memory / (1024**3)
+            allocated_memory = torch.cuda.memory_allocated() / (1024**3)
+            available_memory = total_memory - allocated_memory
+            gpu_memory.append(available_memory)
+    return gpu_memory
+def run_eval(model_path, model_id, question_file, answer_file, num_gpus, load_in_8bit, ray_num_gpus):
+    assert num_gpus % ray_num_gpus == 0
+    # split question file into num_gpus files
+    ques_jsons = []
+    with open(os.path.expanduser(question_file), "r") as ques_file:
+        for line in ques_file:
+            ques_jsons.append(line)
+    chunk_size = len(ques_jsons) // (num_gpus // ray_num_gpus)
+    ans_handles = []
+    for i in range(0, len(ques_jsons), chunk_size):
+        get_answers_func = ray.remote(num_gpus=ray_num_gpus)(
+            get_model_answers
+        ).remote
+        ans_handles.append(
+            get_answers_func(
+                model_path, model_id, ques_jsons[i: i + chunk_size], ray_num_gpus, load_in_8bit
+            )
+        )
+    ans_jsons = []
+    for ans_handle in ans_handles:
+        ans_jsons.extend(ray.get(ans_handle))
+    with open(os.path.expanduser(answer_file), "w") as ans_file:
+        for line in ans_jsons:
+            ans_file.write(json.dumps(line) + "\n")
+@torch.inference_mode()
+def get_model_answers(model_path, model_id, question_jsons, ray_num_gpus, load_in_8bit):
+    model_path = os.path.expanduser(model_path)
+    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, truncation_side='left')
+    available_gpu_memory = get_gpu_memory(ray_num_gpus)
+    gpu_memory_dict = {i: str(int(available_gpu_memory[i] * 0.85)) + "GiB" for i in range(ray_num_gpus)}
+    gpu_memory_dict["cpu"] = "0GiB"
+    if "t5" in model_path:
+        model = AutoModelForSeq2SeqLM.from_pretrained(
+            model_path, torch_dtype=torch.float16, device_map="auto", max_memory=gpu_memory_dict, load_in_8bit=load_in_8bit
+        )
+    else:
+        model = AutoModelForCausalLM.from_pretrained(
+            model_path, torch_dtype=torch.float16, device_map="auto", max_memory=gpu_memory_dict, load_in_8bit=load_in_8bit
+        )
+    # Initialize with BetterTransformer, injecting Flash-Attention
+    model = BetterTransformer.transform(model)
+    # turn on eval mode to stop batch normalizarion & dropout, can work together with torch.inference_mode
+    model = model.eval()
+    ans_jsons = []
+    for i, line in enumerate(tqdm(question_jsons)):
+        ques_json = json.loads(line)
+        idx = ques_json["question_id"]
+        qs = q_pre + ques_json["text"] + qa_link
+        task_type = ques_json["type"]
+        if "t5" in model_path:
+            input_ids = tokenizer([qs], max_length=MaxLen, truncation=True, add_special_tokens=False).input_ids
+            target_len = TaskTarLen[task_type]
+        else:
+            input_ids = tokenizer([qs], max_length=(MaxLen - TarLen), truncation=True, add_special_tokens=False).input_ids
+            target_len = min(len(input_ids[0]) + TaskTarLen[task_type], MaxLen)
+        output_ids = model.generate(
+            torch.as_tensor(input_ids).cuda(),
+            do_sample=True,
+            temperature=0.2,
+            max_length=target_len
+        )
+        if "t5" in model_path:
+            output_ids = output_ids[0]
+        else:
+            output_ids = output_ids[0][len(input_ids[0]):]
+        outputs = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
+        print(outputs)
+        ans_jsons.append(
+            {
+                "question_id": idx,
+                "text": outputs,
+                "model_id": model_id,
+                "metadata": {},
+            }
+        )
+    return ans_jsons
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, required=True)
+    parser.add_argument("--model-id", type=str, required=True)
+    parser.add_argument("--question-file", type=str, required=True)
+    parser.add_argument("--answer-file", type=str, default="answer.jsonl")
+    parser.add_argument("--num-gpus", type=int, default=1)
+    parser.add_argument("--ray-num-gpus", type=int, default=1)
+    parser.add_argument("--load-in-8bit", action="store_true")
+    args = parser.parse_args()
+    os.environ["RAY_DEDUP_LOGS"] = "0"
+    ray.init(num_gpus=args.num_gpus)
+    run_eval(
+        args.model_path,
+        args.model_id,
+        args.question_file,
+        args.answer_file,
+        args.num_gpus,
+        args.load_in_8bit,
+        args.ray_num_gpus
+    )

baselines/MemoChat/code/codes/train/data_preprocess.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import logging
+import sys
+from dataclasses import dataclass, field
+from typing import Optional
+import torch
+import copy
+import datasets
+from datasets import load_dataset
+import transformers
+from transformers import (
+    HfArgumentParser,
+    T5Tokenizer,
+    LlamaTokenizer,
+    set_seed,
+)
+q_pre = "<s>\n"
+qa_link = "\n"
+a_pos = "\n</s>"
+logger = logging.getLogger(__name__)
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+@dataclass
+class DataTrainingArguments:
+    data_path: Optional[str] = field(default=None, metadata={"help": "The input training data file (a jsonlines)."})
+    model_max_length: int = field(
+        default=2048,
+        metadata={
+            "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
+        },
+    )
+    preprocessed_path: str = field(
+        default=None, metadata={"help": "Path to the preprocessed training data."}
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+def main():
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments))
+    model_args, data_args = parser.parse_args_into_dataclasses()
+    data_files = {}
+    data_files["train"] = data_args.data_path
+    raw_datasets = load_dataset(
+        "json",
+        data_files=data_files
+    )
+    column_names = raw_datasets["train"].column_names
+    print("load dataset finished")
+    if "t5" in model_args.model_name_or_path:
+        # use truncation_side='left' to preserve linking between end of prompt and target labels
+        tokenizer = T5Tokenizer.from_pretrained(model_args.model_name_or_path, truncation_side='left')
+        def preprocess_function(examples):
+            src_inputs = [q_pre + example[0]["value"] + qa_link for example in examples["conversations"]]
+            src_model_inputs = tokenizer(src_inputs, max_length=data_args.model_max_length, padding='longest', truncation=True, add_special_tokens=False)
+            trg_inputs = [example[1]["value"] + a_pos for example in examples["conversations"]]
+            trg_model_inputs = tokenizer(trg_inputs, max_length=data_args.model_max_length, padding='longest', truncation=True, add_special_tokens=False)
+            src_model_inputs["labels"] = [
+                [(l if l != tokenizer.pad_token_id else label_ignore_id) for l in label] for label in trg_model_inputs["input_ids"]
+            ]
+            return src_model_inputs
+    else:
+        # use truncation_side='left' to preserve linking between end of prompt and target labels
+        tokenizer = LlamaTokenizer.from_pretrained(model_args.model_name_or_path, truncation_side='left')
+        def preprocess_function(examples):
+            inputs = [q_pre + example[0]["value"] + qa_link + example[1]["value"] + a_pos for example in examples["conversations"]]
+            model_inputs = tokenizer(inputs, max_length=data_args.model_max_length, padding="longest", truncation=True, add_special_tokens=False)
+            model_inputs["labels"] = copy.deepcopy(model_inputs["input_ids"])
+            for e_i, example in enumerate(examples["conversations"]):
+                source_text = q_pre + example[0]["value"] + qa_link
+                target_text = example[1]["value"] + a_pos
+                source_ids = tokenizer.encode(source_text, add_special_tokens=False)
+                target_ids = tokenizer.encode(target_text, add_special_tokens=False)
+                if len(source_ids) >= data_args.model_max_length:
+                    model_inputs["labels"][e_i] = [label_ignore_id] * data_args.model_max_length
+                    continue
+                else:
+                    model_inputs["labels"][e_i][:len(source_ids)] = [label_ignore_id] * len(source_ids)
+                    if len(target_ids) + len(source_ids) >= len(model_inputs["input_ids"][e_i]):
+                        continue
+                    else:
+                        model_inputs["labels"][e_i][(len(target_ids) + len(source_ids)):] = [label_ignore_id] * (len(model_inputs["input_ids"][e_i]) - len(target_ids) - len(source_ids))
+            model_inputs["input_ids"] = torch.tensor(model_inputs["input_ids"])
+            model_inputs["labels"] = torch.tensor(model_inputs["labels"])
+            model_inputs["attention_mask"] = model_inputs["input_ids"].ne(tokenizer.pad_token_id)
+            return model_inputs
+    label_ignore_id = -100
+    print("start data preprocess")
+    train_dataset = raw_datasets["train"]
+    train_dataset = train_dataset.map(
+        preprocess_function,
+        batched=True,
+        batch_size=len(train_dataset),
+        remove_columns=column_names,
+        num_proc=data_args.preprocessing_num_workers,
+        load_from_cache_file=False,
+        desc="Running tokenizer on train dataset"
+    )
+    train_dataset.save_to_disk(data_args.preprocessed_path)
+    print("data preprocess finished")
+if __name__ == "__main__":
+    main()

baselines/MemoChat/code/codes/train/train.py ADDED Viewed

	@@ -0,0 +1,150 @@

+import logging
+import sys
+from dataclasses import dataclass, field
+from typing import Optional
+import torch
+import copy
+import json
+import datasets
+from datasets import load_from_disk
+import transformers
+from transformers import (
+    HfArgumentParser,
+    T5ForConditionalGeneration,
+    T5Tokenizer,
+    T5Config,
+    LlamaForCausalLM,
+    LlamaTokenizer,
+    LlamaConfig,
+    Trainer,
+    TrainingArguments,
+    set_seed,
+)
+from optimum.bettertransformer import BetterTransformer
+q_pre = "<s>\n"
+qa_link = "\n"
+a_pos = "\n</s>"
+logger = logging.getLogger(__name__)
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+@dataclass
+class DataTrainingArguments:
+    model_max_length: int = field(
+        default=2048,
+        metadata={
+            "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
+        },
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+    preprocessed_path: str = field(
+        default=None, metadata={"help": "Path to the preprocessed training data."}
+    )
+def main():
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    if training_args.should_log:
+        # The default of training_args.log_level is passive, so we set log level at info here to have that default.
+        transformers.utils.logging.set_verbosity_info()
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    datasets.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+    logger.info(f"Training/evaluation parameters {training_args}")
+    logger.info("start to load dataset")
+    train_dataset = load_from_disk(data_args.preprocessed_path)
+    column_names = train_dataset.column_names
+    if data_args.max_train_samples is not None:
+        max_train_samples = min(len(train_dataset), data_args.max_train_samples)
+        train_dataset = train_dataset.select(range(max_train_samples))
+    logger.info("load dataset finished")
+    if "t5" in model_args.model_name_or_path:
+        # load config and tokenziers
+        config = T5Config.from_pretrained(model_args.model_name_or_path)
+        config.use_cache=False
+        # use truncation_side='left' to preserve linking between end of prompt and target labels
+        tokenizer = T5Tokenizer.from_pretrained(model_args.model_name_or_path, truncation_side='left')
+        model = T5ForConditionalGeneration.from_pretrained(model_args.model_name_or_path, config=config)
+    else:
+        # load config and tokenziers
+        config = LlamaConfig.from_pretrained(model_args.model_name_or_path)
+        config.use_cache=False
+        # use truncation_side='left' to preserve linking between end of prompt and target labels
+        tokenizer = LlamaTokenizer.from_pretrained(model_args.model_name_or_path, truncation_side='left')
+        # initialize modules
+        model = LlamaForCausalLM.from_pretrained(model_args.model_name_or_path, config=config)
+    # convert normal model to bettertransformer
+    model = BetterTransformer.transform(model)
+    # Setup seed
+    set_seed(training_args.seed)
+    if len(tokenizer) > tokenizer.vocab_size:
+        model.resize_token_embeddings(len(tokenizer))
+    # Setup Trainer
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        tokenizer=tokenizer
+    )
+    # Training
+    train_result = trainer.train()
+    # convert bettertransformer to normal model
+    trainer.model = BetterTransformer.reverse(trainer.model)
+    trainer.save_state()
+    # save fp16 model under deepspeed zero2 or zero3
+    c_stage = json.load(open(training_args.deepspeed, "r"))["zero_optimization"]["stage"]
+    if c_stage in [2, 3]:
+        if c_stage == 2:
+            w_state_dict = trainer.model.state_dict()
+        else:
+            w_state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
+        if trainer.is_world_process_zero():
+            state_dict = {key: value.half().cpu() for key, value in w_state_dict.items()}
+            trainer._save(training_args.output_dir, state_dict=state_dict)
+    else:
+        trainer.save_model()
+if __name__ == "__main__":
+    main()

baselines/MemoChat/code/configs/ds_config_13b.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "total_num_steps" : "auto",
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": true
+    },
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}

baselines/MemoChat/code/configs/ds_config_33b.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "total_num_steps" : "auto",
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": true
+    },
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}

baselines/MemoChat/code/configs/ds_config_3b.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 1
+    },
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}

baselines/MemoChat/code/configs/ds_config_7b.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 2e8,
+        "contiguous_gradients": true
+    },
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}

baselines/MemoChat/code/scripts/llm_judge.sh ADDED Viewed

	@@ -0,0 +1,35 @@

+export GLOO_SOCKET_IFNAME=eth0
+export WANDB_MODE=disabled
+maindir=$1
+datadir=${maindir}data
+codedir=${maindir}code
+settings=("10k")
+models=("t5-3b" "vicuna-7b" "vicuna-13b" "vicuna-33b")
+for model in "${models[@]}"
+    do
+    for setting in "${settings[@]}"
+        do
+        python3 ${codedir}/codes/api/llm_judge.py \
+            ${datadir}/mtbenchplus/mtbenchplus_testing/mtbenchplus_testing_${model}_${setting}.json \
+            gpt-4 \
+            YourOpenAIKey \
+            ${datadir}/llm_judge/llm_judge_gpt-4_${model}_${setting}.json \
+            ${datadir}/prompts.json
+        done
+    done
+gpt_settings=("2k" "memochat")
+for gpt_setting in "${gpt_settings[@]}"
+    do
+    python3 ${codedir}/codes/api/llm_judge.py \
+        ${datadir}/mtbenchplus/mtbenchplus_testing/mtbenchplus_testing_gpt-3.5-turbo-${gpt_setting}.json \
+        gpt-4 \
+        YourOpenAIKey \
+        ${datadir}/llm_judge/llm_judge_gpt-4_gpt-3.5-turbo-${gpt_setting}.json \
+        ${datadir}/prompts.json
+    done

baselines/MemoChat/code/scripts/memochat.sh ADDED Viewed

	@@ -0,0 +1,34 @@

+export GLOO_SOCKET_IFNAME=eth0
+export WANDB_MODE=disabled
+maindir=$1
+datadir=${maindir}data
+codedir=${maindir}code
+test_data=${datadir}/mtbenchplus/mtbenchplus.json
+settings=("1k", "10k")
+models=("t5-3b" "vicuna-7b" "vicuna-13b" "vicuna-33b")
+for model in "${models[@]}"
+    do
+    for setting in "${settings[@]}"
+        do
+        finetuned_model_path=${maindir}model/${model}_${setting}/
+        case ${model} in
+            "vicuna-33b")
+                RAYGPUS=2
+                ;;
+            "t5-3b"|"vicuna-7b"|"vicuna-13b")
+                RAYGPUS=1
+                ;;
+        esac
+        python3 ${codedir}/codes/eval/get_model_infer_memochat.py \
+            --model-path ${finetuned_model_path} \
+            --question-file ${test_data} \
+            --answer-file ${datadir}/mtbenchplus/mtbenchplus_testing/mtbenchplus_testing_${model}_${setting}.json \
+            --num-gpus $GPU_NUM_PER_NODE \
+            --ray-num-gpus ${RAYGPUS} \
+            --prompt-path ${datadir}/prompts.json
+        done
+    done

baselines/MemoChat/code/scripts/memochat_gpt.sh ADDED Viewed

	@@ -0,0 +1,18 @@

+export GLOO_SOCKET_IFNAME=eth0
+export WANDB_MODE=disabled
+maindir=$1
+datadir=${maindir}data
+codedir=${maindir}code
+gpt_settings=("2k" "memochat")
+for gpt_setting in "${gpt_settings[@]}"
+    do
+    python3 ${codedir}/codes/api/gpt_${gpt_setting}.py \
+        ${datadir}/mtbenchplus/mtbenchplus.json \
+        gpt-3.5-turbo \
+        YourOpenAIKey \
+        ${datadir}/mtbenchplus/mtbenchplus_testing/mtbenchplus_testing_gpt-3.5-turbo-${gpt_setting}.json \
+        ${datadir}/prompts.json
+    done

baselines/MemoChat/code/scripts/tuning.sh ADDED Viewed

	@@ -0,0 +1,110 @@

+export GLOO_SOCKET_IFNAME=eth0
+export WANDB_MODE=disabled
+maindir=$1
+datadir=${maindir}data
+codedir=${maindir}code
+MAXLEN=2048
+EPOCH=3
+test_data=${datadir}/memochat_instructions/test.jsonl
+settings=("1k" "10k")
+models=("t5-3b" "vicuna-7b" "vicuna-13b" "vicuna-33b")
+for model in "${models[@]}"
+    do
+    raw_model_path=${maindir}model/fastchat-${model}/
+    case ${model} in
+        "vicuna-33b")
+            RAYGPUS=2
+            ;;
+        "t5-3b"|"vicuna-7b"|"vicuna-13b")
+            RAYGPUS=1
+            ;;
+    esac
+    # zeroshot inference on one node
+    python3 ${codedir}/codes/eval/get_model_infer_simple.py \
+        --model-id ${model}_zeroshot \
+        --model-path ${raw_model_path} \
+        --question-file ${test_data} \
+        --answer-file ${datadir}/instruction_testing/instruction_testing_${model}_zeroshot.jsonl \
+        --num-gpus $GPU_NUM_PER_NODE \
+        --ray-num-gpus ${RAYGPUS}
+    # tuning
+    for setting in "${settings[@]}"
+        do
+        data_path=${datadir}/memochat_instructions/train_${setting}.json
+        preprocessed_data_dir=${datadir}/memochat_instructions/processed_${setting}_${model%-*}.pt
+        model_output_path=${maindir}model/${model}_${setting}/
+        deepspeed_config_path=${codedir}/configs/ds_config_${model#*-}.json
+        case ${model} in
+            "t5-3b")
+                PER_GPU_BATCH=8
+                GRA_ACC=2
+                ;;
+            "vicuna-7b")
+                PER_GPU_BATCH=16
+                GRA_ACC=1
+                ;;
+            "vicuna-13b")
+                PER_GPU_BATCH=8
+                GRA_ACC=2
+                ;;
+            "vicuna-33b")
+                PER_GPU_BATCH=4
+                GRA_ACC=4
+                ;;
+        esac
+        # train data preprocess
+        python3 ${codedir}/codes/train/data_preprocess.py \
+            --model_name_or_path ${raw_model_path} \
+            --data_path ${data_path} \
+            --preprocessing_num_workers=1 \
+            --model_max_length ${MAXLEN} \
+            --preprocessed_path ${preprocessed_data_dir}
+        # training: avaliable for multi nodes
+        torchrun --nnodes=$NODE_NUM \
+            --node_rank=$INDEX \
+            --nproc_per_node $GPU_NUM_PER_NODE \
+            --master_addr $MASTER_ADDR \
+            --master_port $MASTER_PORT \
+            ${codedir}/codes/train/train.py \
+            --model_name_or_path ${raw_model_path} \
+            --bf16 True \
+            --output_dir ${model_output_path} \
+            --num_train_epochs ${EPOCH} \
+            --per_device_train_batch_size ${PER_GPU_BATCH} \
+            --gradient_accumulation_steps ${GRA_ACC} \
+            --save_strategy "steps" \
+            --save_steps 1500 \
+            --save_total_limit 1 \
+            --learning_rate 2e-5 \
+            --log_level "info" \
+            --logging_strategy "steps" \
+            --logging_steps 1 \
+            --weight_decay 0. \
+            --warmup_ratio 0.04 \
+            --lr_scheduler_type "cosine" \
+            --deepspeed ${deepspeed_config_path} \
+            --tf32 True \
+            --model_max_length ${MAXLEN} \
+            --preprocessed_path ${preprocessed_data_dir} \
+            --gradient_checkpointing True
+        # tuning inference
+        python3 ${codedir}/codes/eval/get_model_infer_simple.py \
+            --model-id ${model}_${setting} \
+            --model-path ${model_output_path} \
+            --question-file ${test_data} \
+            --answer-file ${datadir}/instruction_testing/instruction_testing_${model}_${setting}.jsonl \
+            --num-gpus $GPU_NUM_PER_NODE \
+            --ray-num-gpus ${RAYGPUS}
+        done
+    done

baselines/MemoChat/core_requirement.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+accelerate==0.19.0
+datasets==2.10.1
+deepspeed==0.9.4
+evaluate==0.4.0
+Faker==18.11.2
+openai==0.27.2
+optimum==1.9.1
+ray==2.5.1
+tiktoken==0.4.0
+tokenizers==0.13.2
+torch==2.0.1
+torchtext==0.15.2
+transformers==4.29.2

baselines/MemoChat/run_memochat_baseline.py ADDED Viewed

	@@ -0,0 +1,634 @@

+"""
+MemoChat baseline for the EvolV-Mem benchmark.
+Adapts MemoChat's three-stage pipeline (memo writing, retrieval, chatting)
+to the EvolV-Mem benchmark using Qwen-30B via vLLM.
+Pipeline per question:
+  1. Memo writing: extract {topic, summary} from each haystack session (cached)
+  2. Embedding pre-filter: SBert selects top-50 memos by similarity
+  3. MemoChat retrieval: LLM selects final relevant topics from top-50
+  4. Answer generation: LLM generates answer from retrieved memos
+Usage:
+    python baselines/MemoChat/run_memochat_baseline.py \
+        --in_file dataset/evolv_mem_v4.json \
+        --out_file output/memochat_qwen30b_v4.jsonl \
+        --sessions_file dataset/all_sessions.json \
+        --profile_file metadata/generated_user_profile.json
+Env vars:
+    VLLM_BASE_URL  (default http://localhost:8000/v1)
+    VLLM_API_KEY   (default EMPTY)
+"""
+import argparse
+import json
+import logging
+import os
+import re
+import sys
+import time
+from collections import defaultdict
+from typing import Dict, List, Optional, Tuple
+import numpy as np
+from tqdm import tqdm
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+# ---------------------------------------------------------------------------
+# Load MemoChat prompts
+# ---------------------------------------------------------------------------
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+PROMPTS_PATH = os.path.join(SCRIPT_DIR, "data", "prompts.json")
+# ---------------------------------------------------------------------------
+# vLLM LLM helper
+# ---------------------------------------------------------------------------
+def get_llm_client():
+    from openai import OpenAI
+    return OpenAI(
+        base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
+        api_key=os.getenv("VLLM_API_KEY", "EMPTY"),
+    )
+MODEL_NAME = os.getenv("VLLM_MODEL_NAME", "Qwen/Qwen3-30B-A3B-Instruct-2507")
+def llm_call(client, prompt: str, max_tokens: int = 4096, temperature: float = 0.2) -> str:
+    """Call the vLLM server with retry logic."""
+    for attempt in range(6):
+        try:
+            response = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[{"role": "user", "content": prompt}],
+                max_tokens=max_tokens,
+                temperature=temperature,
+            )
+            content = response.choices[0].message.content if response.choices else None
+            if content is None:
+                wait = min(2 ** attempt * 2, 30)
+                print(f"[WARN] LLM returned None content (attempt {attempt+1}); retrying in {wait}s")
+                time.sleep(wait)
+                continue
+            return content.strip()
+        except Exception as e:
+            msg = str(e).lower()
+            if any(code in msg for code in ("429", "500", "503", "rate limit")):
+                wait = min(2 ** attempt * 5, 60)
+                print(f"[WARN] LLM retry {attempt+1}/6, sleeping {wait}s: {e}")
+                time.sleep(wait)
+                continue
+            print(f"[ERROR] LLM call failed: {e}")
+            raise
+    raise RuntimeError("LLM call failed after 6 retries")
+# ---------------------------------------------------------------------------
+# MemoChat output parsers (ported from get_model_infer_memochat.py)
+# ---------------------------------------------------------------------------
+def normalize_model_outputs(model_text: str) -> List[Dict]:
+    """Parse memo writing output into structured topic-summary dicts."""
+    extracted_elements = [
+        re.sub(r'\s+', ' ', mt.replace('"', '').replace("'", ""))
+        for mt in re.findall(r"'[^']*'|\"[^\"]*\"|\d+", model_text)
+    ]
+    model_outputs = []
+    ti = 0
+    while ti + 7 < len(extracted_elements):
+        if (extracted_elements[ti] == "topic"
+                and extracted_elements[ti + 2] == "summary"
+                and extracted_elements[ti + 4] == "start"
+                and extracted_elements[ti + 6] == "end"):
+            try:
+                model_outputs.append({
+                    "topic": extracted_elements[ti + 1],
+                    "summary": extracted_elements[ti + 3],
+                    "start": int(extracted_elements[ti + 5]),
+                    "end": int(extracted_elements[ti + 7]),
+                })
+            except (ValueError, IndexError):
+                pass
+        ti += 1
+    return model_outputs
+def normalize_chatting_outputs(model_outputs: str) -> str:
+    """Clean up chatting response whitespace."""
+    lines = model_outputs.split("\n")
+    result = [' '.join(line.split()) for line in lines]
+    return '\n'.join(result)
+# ---------------------------------------------------------------------------
+# Stage 1: Memo Writing (per session, cached)
+# ---------------------------------------------------------------------------
+def format_session_for_writing(session_turns: List[Dict]) -> str:
+    """Format a session's turns as numbered lines for MemoChat's writing prompt."""
+    lines = []
+    for i, turn in enumerate(session_turns):
+        role = turn.get("role", "user")
+        content = turn.get("content", "").replace("\n", " ")
+        lines.append(f"(line {i + 1}) {role}: {content}")
+    return "\n".join(lines)
+def write_memos_for_session(
+    session_id: str,
+    session_turns: List[Dict],
+    client,
+    prompts: Dict,
+    memo_cache_dir: str,
+) -> List[Dict]:
+    """Extract topic-summary memos from a single session using MemoChat's writing prompt.
+    Returns list of {topic, summary, start, end, session_id}.
+    Results are cached to disk.
+    """
+    cache_path = os.path.join(memo_cache_dir, f"{session_id}.json")
+    if os.path.exists(cache_path):
+        with open(cache_path) as f:
+            return json.load(f)
+    if not session_turns:
+        memos = []
+        with open(cache_path, "w") as f:
+            json.dump(memos, f)
+        return memos
+    # Build MemoChat writing prompt
+    num_lines = len(session_turns)
+    system_instruction = prompts["writing_dialogsum"]["system"]
+    task_instruction = prompts["writing_dialogsum"]["instruction"]
+    history_log = "\n\n```\nTask Conversation:\n" + format_session_for_writing(session_turns)
+    prompt = (
+        system_instruction.replace("LINE", str(num_lines))
+        + history_log
+        + "\n```"
+        + task_instruction.replace("LINE", str(num_lines))
+    )
+    # Call LLM
+    output = llm_call(client, prompt, max_tokens=512, temperature=0.2)
+    memos_raw = normalize_model_outputs(output)
+    # Attach session_id to each memo
+    memos = []
+    for m in memos_raw:
+        memos.append({
+            "topic": m["topic"],
+            "summary": m["summary"],
+            "start": m["start"],
+            "end": m["end"],
+            "session_id": session_id,
+        })
+    # If no memos extracted, create a fallback from the session content
+    if not memos:
+        # Use first and last turn as a basic summary
+        first_content = session_turns[0].get("content", "")[:200] if session_turns else ""
+        memos.append({
+            "topic": f"session_{session_id}",
+            "summary": f"Conversation about: {first_content}...",
+            "start": 1,
+            "end": num_lines,
+            "session_id": session_id,
+        })
+    # Cache
+    os.makedirs(os.path.dirname(cache_path), exist_ok=True)
+    with open(cache_path, "w") as f:
+        json.dump(memos, f)
+    return memos
+# ---------------------------------------------------------------------------
+# Stage 1 (fast): Build memos from pre-computed session summaries
+# ---------------------------------------------------------------------------
+def build_memos_from_summaries(
+    haystack_session_ids: List[str],
+    haystack_dates: List[str],
+    summaries: Dict,
+) -> List[Dict]:
+    """Build memo entries directly from all_session_summary.json — no LLM calls."""
+    memos = []
+    for sid, date_str in zip(haystack_session_ids, haystack_dates):
+        summary_data = summaries.get(sid)
+        if summary_data is None:
+            continue
+        text = summary_data.get("session_summary", "")
+        if not text:
+            turn_sums = summary_data.get("turn_summaries", [])
+            if turn_sums:
+                text = " ".join(turn_sums)
+            else:
+                continue
+        # Use first ~60 chars as topic, full text as summary
+        topic = text[:60].rstrip(". ") if len(text) > 60 else text
+        memos.append({
+            "topic": topic,
+            "summary": text,
+            "session_id": sid,
+            "session_date": date_str,
+        })
+    return memos
+# ---------------------------------------------------------------------------
+# Stage 2: Embedding pre-filter
+# ---------------------------------------------------------------------------
+def embed_and_filter(
+    question: str,
+    all_memos: List[Dict],
+    embedding_model,
+    top_k: int = 50,
+) -> List[Dict]:
+    """Use SBert to select top-k most relevant memos by cosine similarity."""
+    if len(all_memos) <= top_k:
+        return all_memos
+    # Build texts to embed
+    memo_texts = [f"{m['topic']}. {m['summary']}" for m in all_memos]
+    # Encode
+    question_emb = embedding_model.encode(question)
+    memo_embs = embedding_model.encode(memo_texts)
+    # Cosine similarity
+    question_norm = question_emb / (np.linalg.norm(question_emb) + 1e-10)
+    memo_norms = memo_embs / (np.linalg.norm(memo_embs, axis=1, keepdims=True) + 1e-10)
+    similarities = memo_norms @ question_norm
+    # Top-k indices
+    top_indices = np.argsort(similarities)[::-1][:top_k]
+    return [all_memos[i] for i in top_indices]
+# ---------------------------------------------------------------------------
+# Stage 3: MemoChat LLM-based retrieval
+# ---------------------------------------------------------------------------
+def memochat_retrieve(
+    question: str,
+    candidate_memos: List[Dict],
+    client,
+    prompts: Dict,
+) -> List[Dict]:
+    """Apply MemoChat's retrieval prompt to select relevant memos from candidates."""
+    if not candidate_memos:
+        return []
+    # Build topic options list
+    topic_options = []
+    for i, m in enumerate(candidate_memos):
+        topic_options.append(f"({i + 1}) {m['topic']}. {m['summary']}")
+    # Add NOTO option
+    noto_idx = len(candidate_memos) + 1
+    topic_options.append(f"({noto_idx}) NOTO. None of the others.")
+    system_instruction = prompts["retrieval"]["system"]
+    task_instruction = prompts["retrieval"]["instruction"]
+    task_case = (
+        "```\nQuery Sentence:\n" + question
+        + "\nTopic Options:\n" + "\n".join(topic_options)
+        + "\n```"
+    )
+    prompt = (
+        system_instruction.replace("OPTION", str(len(candidate_memos) + 1))
+        + task_case
+        + task_instruction.replace("OPTION", str(len(candidate_memos) + 1))
+    )
+    output = llm_call(client, prompt, max_tokens=2048, temperature=0.2)
+    # Parse selected indices
+    selected_memos = []
+    for part in output.split("#"):
+        part = part.strip()
+        try:
+            idx = int(part) - 1
+            if 0 <= idx < len(candidate_memos):
+                selected_memos.append(candidate_memos[idx])
+        except ValueError:
+            continue
+    return selected_memos
+# ---------------------------------------------------------------------------
+# Stage 4: Answer generation with MemoChat chatting prompt
+# ---------------------------------------------------------------------------
+def generate_answer(
+    question: str,
+    question_date: str,
+    retrieved_memos: List[Dict],
+    user_profile: Optional[str],
+    client,
+    prompts: Dict,
+) -> str:
+    """Generate an answer using MemoChat's chatting prompt format."""
+    system_instruction = prompts["chatting"]["system"]
+    # Build "Related Evidences" section from retrieved memos
+    evidence_lines = []
+    for i, m in enumerate(retrieved_memos):
+        evidence_lines.append(
+            f"({i + 1}) {{'Related Topics': '{m['topic']}', "
+            f"'Related Summaries': '{m['summary']}'}}"
+        )
+    evidences_str = "\n".join(evidence_lines) if evidence_lines else "(No related evidences found)"
+    # Build the chatting prompt
+    profile_section = ""
+    if user_profile:
+        profile_section = f"\nUser Profile:\n{user_profile}\n"
+    task_case = (
+        f"```\nRelated Evidences:\n{evidences_str}"
+        f"\n\nRecent Dialogs:\n(no recent dialogs)"
+        f"\n```"
+        f"{profile_section}"
+        f"\n\nCurrent Date: {question_date}"
+        f"\n\nUser Input:\nuser: {question} ### bot: "
+    )
+    prompt = system_instruction + task_case
+    output = llm_call(client, prompt, max_tokens=8192, temperature=0.2)
+    return normalize_chatting_outputs(output)
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+# ---------------------------------------------------------------------------
+# Retrieval metrics
+# ---------------------------------------------------------------------------
+def evaluate_retrieval(recalled_docs, correct_docs):
+    recall_any = float(any(doc in recalled_docs for doc in correct_docs))
+    recall_all = float(all(doc in recalled_docs for doc in correct_docs))
+    return recall_any, recall_all
+def print_average_metrics(retrieval_metric_list):
+    metric_sums = defaultdict(float)
+    metric_counts = defaultdict(int)
+    for metric in retrieval_metric_list:
+        for k, v in metric.items():
+            metric_sums[k] += v
+            metric_counts[k] += 1
+    print("  Average retrieval metrics:")
+    for k in sorted(metric_sums):
+        avg = metric_sums[k] / metric_counts[k]
+        print(f"    {k}: {avg:.4f}")
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="MemoChat baseline for EvolV-Mem")
+    parser.add_argument("--in_file", type=str, required=True,
+                        help="Path to evolv_mem_v4.json")
+    parser.add_argument("--out_file", type=str, required=True,
+                        help="Output JSONL file")
+    parser.add_argument("--sessions_file", type=str, default=None,
+                        help="Path to all_sessions.json (only needed with --use_llm_memos)")
+    parser.add_argument("--summary_file", type=str, default=None,
+                        help="Path to all_session_summary.json (used by default for memo bank)")
+    parser.add_argument("--profile_file", type=str, default=None,
+                        help="Path to generated_user_profile.json")
+    parser.add_argument("--memo_cache_dir", type=str,
+                        default="baselines/MemoChat/memo_cache",
+                        help="Directory to cache per-session memos")
+    parser.add_argument("--prompt_file", type=str, default=None,
+                        help="Path to prompts.json (default: baselines/MemoChat/data/prompts.json)")
+    # Retrieval params
+    parser.add_argument("--embed_top_k", type=int, default=50,
+                        help="Number of memos to keep after embedding pre-filter (default 50)")
+    parser.add_argument("--embedding_model", type=str,
+                        default="sentence-transformers/multi-qa-mpnet-base-cos-v1",
+                        help="SentenceTransformer model for embedding pre-filter")
+    # Limit (for debugging)
+    parser.add_argument("--limit", type=int, default=None,
+                        help="Process only the first N questions")
+    parser.add_argument("--use_llm_memos", action="store_true", default=False,
+                        help="Use LLM-based memo writing instead of cached session summaries")
+    args = parser.parse_args()
+    # -----------------------------------------------------------------------
+    # Load data
+    # -----------------------------------------------------------------------
+    print(f"Loading benchmark from {args.in_file} ...")
+    with open(args.in_file) as f:
+        benchmark = json.load(f)
+    if args.limit:
+        benchmark = benchmark[:args.limit]
+    print(f"  {len(benchmark)} questions loaded.")
+    all_sessions = {}
+    if args.sessions_file and os.path.exists(args.sessions_file):
+        print(f"Loading sessions from {args.sessions_file} ...")
+        with open(args.sessions_file) as f:
+            all_sessions = json.load(f)
+        print(f"  {len(all_sessions)} sessions loaded.")
+    summaries = {}
+    if args.summary_file and os.path.exists(args.summary_file):
+        print(f"Loading session summaries from {args.summary_file} ...")
+        with open(args.summary_file) as f:
+            summaries = json.load(f)
+        print(f"  {len(summaries)} session summaries loaded.")
+    if not args.use_llm_memos and not summaries:
+        print("ERROR: --summary_file is required unless --use_llm_memos is set.")
+        sys.exit(1)
+    if args.use_llm_memos and not all_sessions:
+        print("ERROR: --sessions_file is required when --use_llm_memos is set.")
+        sys.exit(1)
+    profiles = {}
+    if args.profile_file and os.path.exists(args.profile_file):
+        print(f"Loading user profiles from {args.profile_file} ...")
+        with open(args.profile_file) as f:
+            profiles = json.load(f)
+        print(f"  {len(profiles)} profiles loaded.")
+    prompt_file = args.prompt_file or PROMPTS_PATH
+    print(f"Loading prompts from {prompt_file} ...")
+    with open(prompt_file) as f:
+        prompts = json.load(f)
+    # -----------------------------------------------------------------------
+    # Resume support
+    # -----------------------------------------------------------------------
+    existing_qids = set()
+    if os.path.exists(args.out_file):
+        with open(args.out_file) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    obj = json.loads(line)
+                    existing_qids.add(obj["question_id"])
+        print(f"  Resuming: {len(existing_qids)} questions already processed.")
+    # -----------------------------------------------------------------------
+    # Initialize models
+    # -----------------------------------------------------------------------
+    print("Initializing embedding model ...")
+    from sentence_transformers import SentenceTransformer
+    embedding_model = SentenceTransformer(args.embedding_model)
+    print("Initializing vLLM client ...")
+    client = get_llm_client()
+    os.makedirs(args.memo_cache_dir, exist_ok=True)
+    # -----------------------------------------------------------------------
+    # Process questions
+    # -----------------------------------------------------------------------
+    retrieval_metric_list = []
+    out_f = open(args.out_file, "a")
+    for di, entry in enumerate(tqdm(benchmark, desc="MemoChat baseline")):
+        qid = entry["question_id"]
+        question = entry["question"]
+        question_date = entry["question_date"]
+        if qid in existing_qids:
+            continue
+        try:
+            haystack_session_ids = entry["haystack_session_ids"]
+            # ------ Stage 1: Build memo bank ------
+            if args.use_llm_memos:
+                # Slow path: LLM-based memo writing (cached per session)
+                all_memos = []
+                n_cached = 0
+                n_written = 0
+                date_lookup = dict(zip(
+                    entry["haystack_session_ids"], entry["haystack_dates"]
+                ))
+                for sid in haystack_session_ids:
+                    session_turns = all_sessions.get(sid, [])
+                    cache_exists = os.path.exists(
+                        os.path.join(args.memo_cache_dir, f"{sid}.json")
+                    )
+                    memos = write_memos_for_session(
+                        sid, session_turns, client, prompts, args.memo_cache_dir
+                    )
+                    for m in memos:
+                        m["session_date"] = date_lookup.get(sid, "")
+                    all_memos.extend(memos)
+                    if cache_exists:
+                        n_cached += 1
+                    else:
+                        n_written += 1
+                print(f"  [{di}] qid={qid}: {len(all_memos)} memos "
+                      f"({n_cached} cached, {n_written} new)")
+            else:
+                # Fast path: use pre-computed session summaries as memos
+                all_memos = build_memos_from_summaries(
+                    haystack_session_ids, entry["haystack_dates"], summaries
+                )
+                print(f"  [{di}] qid={qid}: {len(all_memos)} memos from summaries")
+            if not all_memos:
+                result = {
+                    "q_idx": di,
+                    "question_id": qid,
+                    "hypothesis": "Insufficient information to answer.",
+                    "n_memos": 0,
+                }
+                print(json.dumps(result), file=out_f, flush=True)
+                continue
+            # ------ Stage 2: Embedding pre-filter ------
+            filtered_memos = embed_and_filter(
+                question, all_memos, embedding_model, top_k=args.embed_top_k
+            )
+            print(f"  [{di}] Embedding filter: {len(all_memos)} -> {len(filtered_memos)} memos")
+            # ------ Stage 3: MemoChat LLM retrieval ------
+            retrieved_memos = memochat_retrieve(
+                question, filtered_memos, client, prompts
+            )
+            print(f"  [{di}] MemoChat retrieval: {len(filtered_memos)} -> {len(retrieved_memos)} memos")
+            # Fallback: if retrieval selected nothing, use top-5 from embedding filter
+            if not retrieved_memos:
+                retrieved_memos = filtered_memos[:5]
+                print(f"  [{di}] Fallback: using top-5 from embedding filter")
+            # ------ Stage 4: Answer generation ------
+            user_id = qid.split("_q_")[0] if "_q_" in qid else qid
+            user_profile = profiles.get(user_id, None)
+            answer = generate_answer(
+                question, question_date, retrieved_memos,
+                user_profile, client, prompts
+            )
+            # ------ Output ------
+            retrieved_session_ids = list(dict.fromkeys(
+                m["session_id"] for m in retrieved_memos if "session_id" in m
+            ))
+            # Compute retrieval metrics
+            answer_session_ids = entry.get("answer_session_ids", [])
+            retrieval_metric = {}
+            if answer_session_ids and retrieved_session_ids:
+                for topk in [5, 10, 20, 30]:
+                    r_any, r_all = evaluate_retrieval(
+                        retrieved_session_ids[:topk], answer_session_ids
+                    )
+                    retrieval_metric[f"recall_any@{topk}"] = r_any
+                    retrieval_metric[f"recall_all@{topk}"] = r_all
+                retrieval_metric_list.append(retrieval_metric)
+                print_average_metrics(retrieval_metric_list)
+            result = {
+                "q_idx": di,
+                "question_id": qid,
+                "hypothesis": answer,
+                "n_memos_total": len(all_memos),
+                "n_memos_filtered": len(filtered_memos),
+                "n_memos_retrieved": len(retrieved_memos),
+                "retrieved_session_ids": retrieved_session_ids,
+                "retrieval_metric": retrieval_metric,
+            }
+            print(json.dumps(result), file=out_f, flush=True)
+            print(f"  [{di}] Q: {question[:100]}...")
+            print(f"  [{di}] A: {answer[:200]}...")
+        except Exception as e:
+            print(f"[ERROR] q_idx={di} qid={qid} failed: {e}", flush=True)
+            import traceback
+            traceback.print_exc()
+            continue
+    out_f.close()
+    print(f"\nDone. Results saved to {args.out_file}")
+if __name__ == "__main__":
+    main()

baselines/raptor/LICENSE.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+The MIT License
+Copyright (c) Parth Sarthi
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

baselines/raptor/README.md ADDED Viewed

	@@ -0,0 +1,204 @@

+<!-- <p align="center">
+  <img align="center" src="raptor.jpg" width="1000px" />
+</p>
+<p align="left"> -->
+<!-- <picture>
+  <source media="(prefers-color-scheme: dark)" srcset="raptor.jpg" width="1000px">
+  <source media="(prefers-color-scheme: light)" srcset="raptor_dark.png" width="1000px">
+</picture> -->
+<picture>
+  <source media="(prefers-color-scheme: dark)" srcset="raptor_dark.png">
+  <img alt="Shows an illustrated sun in light color mode and a moon with stars in dark color mode." src="raptor.jpg">
+</picture>
+## RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
+**RAPTOR** introduces a novel approach to retrieval-augmented language models by constructing a recursive tree structure from documents. This allows for more efficient and context-aware information retrieval across large texts, addressing common limitations in traditional language models.
+For detailed methodologies and implementations, refer to the original paper:
+- [RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval](https://arxiv.org/abs/2401.18059)
+[![Paper page](https://huggingface.co/datasets/huggingface/badges/resolve/main/paper-page-sm.svg)](https://huggingface.co/papers/2401.18059)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/raptor-recursive-abstractive-processing-for/question-answering-on-quality)](https://paperswithcode.com/sota/question-answering-on-quality?p=raptor-recursive-abstractive-processing-for)
+## Installation
+Before using RAPTOR, ensure Python 3.8+ is installed. Clone the RAPTOR repository and install necessary dependencies:
+```bash
+git clone https://github.com/parthsarthi03/raptor.git
+cd raptor
+pip install -r requirements.txt
+```
+## Basic Usage
+To get started with RAPTOR, follow these steps:
+### Setting Up RAPTOR
+First, set your OpenAI API key and initialize the RAPTOR configuration:
+```python
+import os
+os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
+from raptor import RetrievalAugmentation
+# Initialize with default configuration. For advanced configurations, check the documentation. [WIP]
+RA = RetrievalAugmentation()
+```
+### Adding Documents to the Tree
+Add your text documents to RAPTOR for indexing:
+```python
+with open('sample.txt', 'r') as file:
+    text = file.read()
+RA.add_documents(text)
+```
+### Answering Questions
+You can now use RAPTOR to answer questions based on the indexed documents:
+```python
+question = "How did Cinderella reach her happy ending?"
+answer = RA.answer_question(question=question)
+print("Answer: ", answer)
+```
+### Saving and Loading the Tree
+Save the constructed tree to a specified path:
+```python
+SAVE_PATH = "demo/cinderella"
+RA.save(SAVE_PATH)
+```
+Load the saved tree back into RAPTOR:
+```python
+RA = RetrievalAugmentation(tree=SAVE_PATH)
+answer = RA.answer_question(question=question)
+```
+### Extending RAPTOR with other Models
+RAPTOR is designed to be flexible and allows you to integrate any models for summarization, question-answering (QA), and embedding generation. Here is how to extend RAPTOR with your own models:
+#### Custom Summarization Model
+If you wish to use a different language model for summarization, you can do so by extending the `BaseSummarizationModel` class. Implement the `summarize` method to integrate your custom summarization logic:
+```python
+from raptor import BaseSummarizationModel
+class CustomSummarizationModel(BaseSummarizationModel):
+    def __init__(self):
+        # Initialize your model here
+        pass
+    def summarize(self, context, max_tokens=150):
+        # Implement your summarization logic here
+        # Return the summary as a string
+        summary = "Your summary here"
+        return summary
+```
+#### Custom QA Model
+For custom QA models, extend the `BaseQAModel` class and implement the `answer_question` method. This method should return the best answer found by your model given a context and a question:
+```python
+from raptor import BaseQAModel
+class CustomQAModel(BaseQAModel):
+    def __init__(self):
+        # Initialize your model here
+        pass
+    def answer_question(self, context, question):
+        # Implement your QA logic here
+        # Return the answer as a string
+        answer = "Your answer here"
+        return answer
+```
+#### Custom Embedding Model
+To use a different embedding model, extend the `BaseEmbeddingModel` class. Implement the `create_embedding` method, which should return a vector representation of the input text:
+```python
+from raptor import BaseEmbeddingModel
+class CustomEmbeddingModel(BaseEmbeddingModel):
+    def __init__(self):
+        # Initialize your model here
+        pass
+    def create_embedding(self, text):
+        # Implement your embedding logic here
+        # Return the embedding as a numpy array or a list of floats
+        embedding = [0.0] * embedding_dim  # Replace with actual embedding logic
+        return embedding
+```
+#### Integrating Custom Models with RAPTOR
+After implementing your custom models, integrate them with RAPTOR as follows:
+```python
+from raptor import RetrievalAugmentation, RetrievalAugmentationConfig
+# Initialize your custom models
+custom_summarizer = CustomSummarizationModel()
+custom_qa = CustomQAModel()
+custom_embedding = CustomEmbeddingModel()
+# Create a config with your custom models
+custom_config = RetrievalAugmentationConfig(
+    summarization_model=custom_summarizer,
+    qa_model=custom_qa,
+    embedding_model=custom_embedding
+)
+# Initialize RAPTOR with your custom config
+RA = RetrievalAugmentation(config=custom_config)
+```
+Check out `demo.ipynb` for examples on how to specify your own summarization/QA models, such as Llama/Mistral/Gemma, and Embedding Models such as SBERT, for use with RAPTOR.
+Note: More examples and ways to configure RAPTOR are forthcoming. Advanced usage and additional features will be provided in the documentation and repository updates.
+## Contributing
+RAPTOR is an open-source project, and contributions are welcome. Whether you're fixing bugs, adding new features, or improving documentation, your help is appreciated.
+## License
+RAPTOR is released under the MIT License. See the LICENSE file in the repository for full details.
+## Citation
+If RAPTOR assists in your research, please cite it as follows:
+```bibtex
+@inproceedings{sarthi2024raptor,
+    title={RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval},
+    author={Sarthi, Parth and Abdullah, Salman and Tuli, Aditi and Khanna, Shubh and Goldie, Anna and Manning, Christopher D.},
+    booktitle={International Conference on Learning Representations (ICLR)},
+    year={2024}
+}
+```
+Stay tuned for more examples, configuration guides, and updates.

baselines/raptor/raptor/EmbeddingModels.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import logging
+from abc import ABC, abstractmethod
+from openai import OpenAI
+from sentence_transformers import SentenceTransformer
+from tenacity import retry, stop_after_attempt, wait_random_exponential
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+class BaseEmbeddingModel(ABC):
+    @abstractmethod
+    def create_embedding(self, text):
+        pass
+class OpenAIEmbeddingModel(BaseEmbeddingModel):
+    def __init__(self, model="text-embedding-ada-002"):
+        self.client = OpenAI()
+        self.model = model
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def create_embedding(self, text):
+        text = text.replace("\n", " ")
+        return (
+            self.client.embeddings.create(input=[text], model=self.model)
+            .data[0]
+            .embedding
+        )
+class SBertEmbeddingModel(BaseEmbeddingModel):
+    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
+        self.model = SentenceTransformer(model_name)
+    def create_embedding(self, text):
+        return self.model.encode(text)

baselines/raptor/raptor/FaissRetriever.py ADDED Viewed

	@@ -0,0 +1,201 @@

+import random
+from concurrent.futures import ProcessPoolExecutor
+import faiss
+import numpy as np
+import tiktoken
+from tqdm import tqdm
+from .EmbeddingModels import BaseEmbeddingModel, OpenAIEmbeddingModel
+from .Retrievers import BaseRetriever
+from .utils import split_text
+class FaissRetrieverConfig:
+    def __init__(
+        self,
+        max_tokens=100,
+        max_context_tokens=3500,
+        use_top_k=False,
+        embedding_model=None,
+        question_embedding_model=None,
+        top_k=5,
+        tokenizer=tiktoken.get_encoding("cl100k_base"),
+        embedding_model_string=None,
+    ):
+        if max_tokens < 1:
+            raise ValueError("max_tokens must be at least 1")
+        if top_k < 1:
+            raise ValueError("top_k must be at least 1")
+        if max_context_tokens is not None and max_context_tokens < 1:
+            raise ValueError("max_context_tokens must be at least 1 or None")
+        if embedding_model is not None and not isinstance(
+            embedding_model, BaseEmbeddingModel
+        ):
+            raise ValueError(
+                "embedding_model must be an instance of BaseEmbeddingModel or None"
+            )
+        if question_embedding_model is not None and not isinstance(
+            question_embedding_model, BaseEmbeddingModel
+        ):
+            raise ValueError(
+                "question_embedding_model must be an instance of BaseEmbeddingModel or None"
+            )
+        self.top_k = top_k
+        self.max_tokens = max_tokens
+        self.max_context_tokens = max_context_tokens
+        self.use_top_k = use_top_k
+        self.embedding_model = embedding_model or OpenAIEmbeddingModel()
+        self.question_embedding_model = question_embedding_model or self.embedding_model
+        self.tokenizer = tokenizer
+        self.embedding_model_string = embedding_model_string or "OpenAI"
+    def log_config(self):
+        config_summary = """
+		FaissRetrieverConfig:
+			Max Tokens: {max_tokens}
+			Max Context Tokens: {max_context_tokens}
+			Use Top K: {use_top_k}
+			Embedding Model: {embedding_model}
+			Question Embedding Model: {question_embedding_model}
+			Top K: {top_k}
+			Tokenizer: {tokenizer}
+			Embedding Model String: {embedding_model_string}
+		""".format(
+            max_tokens=self.max_tokens,
+            max_context_tokens=self.max_context_tokens,
+            use_top_k=self.use_top_k,
+            embedding_model=self.embedding_model,
+            question_embedding_model=self.question_embedding_model,
+            top_k=self.top_k,
+            tokenizer=self.tokenizer,
+            embedding_model_string=self.embedding_model_string,
+        )
+        return config_summary
+class FaissRetriever(BaseRetriever):
+    """
+    FaissRetriever is a class that retrieves similar context chunks for a given query using Faiss.
+    encoders_type is 'same' if the question and context encoder is the same,
+    otherwise, encoders_type is 'different'.
+    """
+    def __init__(self, config):
+        self.embedding_model = config.embedding_model
+        self.question_embedding_model = config.question_embedding_model
+        self.index = None
+        self.context_chunks = None
+        self.max_tokens = config.max_tokens
+        self.max_context_tokens = config.max_context_tokens
+        self.use_top_k = config.use_top_k
+        self.tokenizer = config.tokenizer
+        self.top_k = config.top_k
+        self.embedding_model_string = config.embedding_model_string
+    def build_from_text(self, doc_text):
+        """
+        Builds the index from a given text.
+        :param doc_text: A string containing the document text.
+        :param tokenizer: A tokenizer used to split the text into chunks.
+        :param max_tokens: An integer representing the maximum number of tokens per chunk.
+        """
+        self.context_chunks = np.array(
+            split_text(doc_text, self.tokenizer, self.max_tokens)
+        )
+        with ProcessPoolExecutor() as executor:
+            futures = [
+                executor.submit(self.embedding_model.create_embedding, context_chunk)
+                for context_chunk in self.context_chunks
+            ]
+        self.embeddings = []
+        for future in tqdm(futures, total=len(futures), desc="Building embeddings"):
+            self.embeddings.append(future.result())
+        self.embeddings = np.array(self.embeddings, dtype=np.float32)
+        self.index = faiss.IndexFlatIP(self.embeddings.shape[1])
+        self.index.add(self.embeddings)
+    def build_from_leaf_nodes(self, leaf_nodes):
+        """
+        Builds the index from a given text.
+        :param doc_text: A string containing the document text.
+        :param tokenizer: A tokenizer used to split the text into chunks.
+        :param max_tokens: An integer representing the maximum number of tokens per chunk.
+        """
+        self.context_chunks = [node.text for node in leaf_nodes]
+        self.embeddings = np.array(
+            [node.embeddings[self.embedding_model_string] for node in leaf_nodes],
+            dtype=np.float32,
+        )
+        self.index = faiss.IndexFlatIP(self.embeddings.shape[1])
+        self.index.add(self.embeddings)
+    def sanity_check(self, num_samples=4):
+        """
+        Perform a sanity check by recomputing embeddings of a few randomly-selected chunks.
+        :param num_samples: The number of samples to test.
+        """
+        indices = random.sample(range(len(self.context_chunks)), num_samples)
+        for i in indices:
+            original_embedding = self.embeddings[i]
+            recomputed_embedding = self.embedding_model.create_embedding(
+                self.context_chunks[i]
+            )
+            assert np.allclose(
+                original_embedding, recomputed_embedding
+            ), f"Embeddings do not match for index {i}!"
+        print(f"Sanity check passed for {num_samples} random samples.")
+    def retrieve(self, query: str) -> str:
+        """
+        Retrieves the k most similar context chunks for a given query.
+        :param query: A string containing the query.
+        :param k: An integer representing the number of similar context chunks to retrieve.
+        :return: A string containing the retrieved context chunks.
+        """
+        query_embedding = np.array(
+            [
+                np.array(
+                    self.question_embedding_model.create_embedding(query),
+                    dtype=np.float32,
+                ).squeeze()
+            ]
+        )
+        context = ""
+        if self.use_top_k:
+            _, indices = self.index.search(query_embedding, self.top_k)
+            for i in range(self.top_k):
+                context += self.context_chunks[indices[0][i]]
+        else:
+            range_ = int(self.max_context_tokens / self.max_tokens)
+            _, indices = self.index.search(query_embedding, range_)
+            total_tokens = 0
+            for i in range(range_):
+                tokens = len(self.tokenizer.encode(self.context_chunks[indices[0][i]]))
+                context += self.context_chunks[indices[0][i]]
+                if total_tokens + tokens > self.max_context_tokens:
+                    break
+                total_tokens += tokens
+        return context

baselines/raptor/raptor/QAModels.py ADDED Viewed

	@@ -0,0 +1,185 @@

+import logging
+import os
+from openai import OpenAI
+import getpass
+from abc import ABC, abstractmethod
+import torch
+from tenacity import retry, stop_after_attempt, wait_random_exponential
+from transformers import T5ForConditionalGeneration, T5Tokenizer
+class BaseQAModel(ABC):
+    @abstractmethod
+    def answer_question(self, context, question):
+        pass
+class GPT3QAModel(BaseQAModel):
+    def __init__(self, model="text-davinci-003"):
+        """
+        Initializes the GPT-3 model with the specified model version.
+        Args:
+            model (str, optional): The GPT-3 model version to use for generating summaries. Defaults to "text-davinci-003".
+        """
+        self.model = model
+        self.client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def answer_question(self, context, question, max_tokens=150, stop_sequence=None):
+        """
+        Generates a summary of the given context using the GPT-3 model.
+        Args:
+            context (str): The text to summarize.
+            max_tokens (int, optional): The maximum number of tokens in the generated summary. Defaults to 150.
+            stop_sequence (str, optional): The sequence at which to stop summarization. Defaults to None.
+        Returns:
+            str: The generated summary.
+        """
+        try:
+            response = self.client.completions.create(
+                prompt=f"using the folloing information {context}. Answer the following question in less than 5-7 words, if possible: {question}",
+                temperature=0,
+                max_tokens=max_tokens,
+                top_p=1,
+                frequency_penalty=0,
+                presence_penalty=0,
+                stop=stop_sequence,
+                model=self.model,
+            )
+            return response.choices[0].text.strip()
+        except Exception as e:
+            print(e)
+            return ""
+class GPT3TurboQAModel(BaseQAModel):
+    def __init__(self, model="gpt-3.5-turbo"):
+        """
+        Initializes the GPT-3 model with the specified model version.
+        Args:
+            model (str, optional): The GPT-3 model version to use for generating summaries. Defaults to "text-davinci-003".
+        """
+        self.model = model
+        self.client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def _attempt_answer_question(
+        self, context, question, max_tokens=150, stop_sequence=None
+    ):
+        """
+        Generates a summary of the given context using the GPT-3 model.
+        Args:
+            context (str): The text to summarize.
+            max_tokens (int, optional): The maximum number of tokens in the generated summary. Defaults to 150.
+            stop_sequence (str, optional): The sequence at which to stop summarization. Defaults to None.
+        Returns:
+            str: The generated summary.
+        """
+        response = self.client.chat.completions.create(
+            model=self.model,
+            messages=[
+                {"role": "system", "content": "You are Question Answering Portal"},
+                {
+                    "role": "user",
+                    "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}",
+                },
+            ],
+            temperature=0,
+        )
+        return response.choices[0].message.content.strip()
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def answer_question(self, context, question, max_tokens=150, stop_sequence=None):
+        try:
+            return self._attempt_answer_question(
+                context, question, max_tokens=max_tokens, stop_sequence=stop_sequence
+            )
+        except Exception as e:
+            print(e)
+            return e
+class GPT4QAModel(BaseQAModel):
+    def __init__(self, model="gpt-4"):
+        """
+        Initializes the GPT-3 model with the specified model version.
+        Args:
+            model (str, optional): The GPT-3 model version to use for generating summaries. Defaults to "text-davinci-003".
+        """
+        self.model = model
+        self.client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def _attempt_answer_question(
+        self, context, question, max_tokens=150, stop_sequence=None
+    ):
+        """
+        Generates a summary of the given context using the GPT-3 model.
+        Args:
+            context (str): The text to summarize.
+            max_tokens (int, optional): The maximum number of tokens in the generated summary. Defaults to 150.
+            stop_sequence (str, optional): The sequence at which to stop summarization. Defaults to None.
+        Returns:
+            str: The generated summary.
+        """
+        response = self.client.chat.completions.create(
+            model=self.model,
+            messages=[
+                {"role": "system", "content": "You are Question Answering Portal"},
+                {
+                    "role": "user",
+                    "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}",
+                },
+            ],
+            temperature=0,
+        )
+        return response.choices[0].message.content.strip()
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def answer_question(self, context, question, max_tokens=150, stop_sequence=None):
+        try:
+            return self._attempt_answer_question(
+                context, question, max_tokens=max_tokens, stop_sequence=stop_sequence
+            )
+        except Exception as e:
+            print(e)
+            return e
+class UnifiedQAModel(BaseQAModel):
+    def __init__(self, model_name="allenai/unifiedqa-v2-t5-3b-1363200"):
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.model = T5ForConditionalGeneration.from_pretrained(model_name).to(
+            self.device
+        )
+        self.tokenizer = T5Tokenizer.from_pretrained(model_name)
+    def run_model(self, input_string, **generator_args):
+        input_ids = self.tokenizer.encode(input_string, return_tensors="pt").to(
+            self.device
+        )
+        res = self.model.generate(input_ids, **generator_args)
+        return self.tokenizer.batch_decode(res, skip_special_tokens=True)
+    def answer_question(self, context, question):
+        input_string = question + " \\n " + context
+        output = self.run_model(input_string)
+        return output[0]

baselines/raptor/raptor/RetrievalAugmentation.py ADDED Viewed

	@@ -0,0 +1,306 @@

+import logging
+import pickle
+from .cluster_tree_builder import ClusterTreeBuilder, ClusterTreeConfig
+from .EmbeddingModels import BaseEmbeddingModel
+from .QAModels import BaseQAModel, GPT3TurboQAModel
+from .SummarizationModels import BaseSummarizationModel
+from .tree_builder import TreeBuilder, TreeBuilderConfig
+from .tree_retriever import TreeRetriever, TreeRetrieverConfig
+from .tree_structures import Node, Tree
+# Define a dictionary to map supported tree builders to their respective configs
+supported_tree_builders = {"cluster": (ClusterTreeBuilder, ClusterTreeConfig)}
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+class RetrievalAugmentationConfig:
+    def __init__(
+        self,
+        tree_builder_config=None,
+        tree_retriever_config=None,  # Change from default instantiation
+        qa_model=None,
+        embedding_model=None,
+        summarization_model=None,
+        tree_builder_type="cluster",
+        # New parameters for TreeRetrieverConfig and TreeBuilderConfig
+        # TreeRetrieverConfig arguments
+        tr_tokenizer=None,
+        tr_threshold=0.5,
+        tr_top_k=5,
+        tr_selection_mode="top_k",
+        tr_context_embedding_model="OpenAI",
+        tr_embedding_model=None,
+        tr_num_layers=None,
+        tr_start_layer=None,
+        # TreeBuilderConfig arguments
+        tb_tokenizer=None,
+        tb_max_tokens=100,
+        tb_num_layers=5,
+        tb_threshold=0.5,
+        tb_top_k=5,
+        tb_selection_mode="top_k",
+        tb_summarization_length=100,
+        tb_summarization_model=None,
+        tb_embedding_models=None,
+        tb_cluster_embedding_model="OpenAI",
+    ):
+        # Validate tree_builder_type
+        if tree_builder_type not in supported_tree_builders:
+            raise ValueError(
+                f"tree_builder_type must be one of {list(supported_tree_builders.keys())}"
+            )
+        # Validate qa_model
+        if qa_model is not None and not isinstance(qa_model, BaseQAModel):
+            raise ValueError("qa_model must be an instance of BaseQAModel")
+        if embedding_model is not None and not isinstance(
+            embedding_model, BaseEmbeddingModel
+        ):
+            raise ValueError(
+                "embedding_model must be an instance of BaseEmbeddingModel"
+            )
+        elif embedding_model is not None:
+            if tb_embedding_models is not None:
+                raise ValueError(
+                    "Only one of 'tb_embedding_models' or 'embedding_model' should be provided, not both."
+                )
+            tb_embedding_models = {"EMB": embedding_model}
+            tr_embedding_model = embedding_model
+            tb_cluster_embedding_model = "EMB"
+            tr_context_embedding_model = "EMB"
+        if summarization_model is not None and not isinstance(
+            summarization_model, BaseSummarizationModel
+        ):
+            raise ValueError(
+                "summarization_model must be an instance of BaseSummarizationModel"
+            )
+        elif summarization_model is not None:
+            if tb_summarization_model is not None:
+                raise ValueError(
+                    "Only one of 'tb_summarization_model' or 'summarization_model' should be provided, not both."
+                )
+            tb_summarization_model = summarization_model
+        # Set TreeBuilderConfig
+        tree_builder_class, tree_builder_config_class = supported_tree_builders[
+            tree_builder_type
+        ]
+        if tree_builder_config is None:
+            tree_builder_config = tree_builder_config_class(
+                tokenizer=tb_tokenizer,
+                max_tokens=tb_max_tokens,
+                num_layers=tb_num_layers,
+                threshold=tb_threshold,
+                top_k=tb_top_k,
+                selection_mode=tb_selection_mode,
+                summarization_length=tb_summarization_length,
+                summarization_model=tb_summarization_model,
+                embedding_models=tb_embedding_models,
+                cluster_embedding_model=tb_cluster_embedding_model,
+            )
+        elif not isinstance(tree_builder_config, tree_builder_config_class):
+            raise ValueError(
+                f"tree_builder_config must be a direct instance of {tree_builder_config_class} for tree_builder_type '{tree_builder_type}'"
+            )
+        # Set TreeRetrieverConfig
+        if tree_retriever_config is None:
+            tree_retriever_config = TreeRetrieverConfig(
+                tokenizer=tr_tokenizer,
+                threshold=tr_threshold,
+                top_k=tr_top_k,
+                selection_mode=tr_selection_mode,
+                context_embedding_model=tr_context_embedding_model,
+                embedding_model=tr_embedding_model,
+                num_layers=tr_num_layers,
+                start_layer=tr_start_layer,
+            )
+        elif not isinstance(tree_retriever_config, TreeRetrieverConfig):
+            raise ValueError(
+                "tree_retriever_config must be an instance of TreeRetrieverConfig"
+            )
+        # Assign the created configurations to the instance
+        self.tree_builder_config = tree_builder_config
+        self.tree_retriever_config = tree_retriever_config
+        self.qa_model = qa_model or GPT3TurboQAModel()
+        self.tree_builder_type = tree_builder_type
+    def log_config(self):
+        config_summary = """
+        RetrievalAugmentationConfig:
+            {tree_builder_config}
+            {tree_retriever_config}
+            QA Model: {qa_model}
+            Tree Builder Type: {tree_builder_type}
+        """.format(
+            tree_builder_config=self.tree_builder_config.log_config(),
+            tree_retriever_config=self.tree_retriever_config.log_config(),
+            qa_model=self.qa_model,
+            tree_builder_type=self.tree_builder_type,
+        )
+        return config_summary
+class RetrievalAugmentation:
+    """
+    A Retrieval Augmentation class that combines the TreeBuilder and TreeRetriever classes.
+    Enables adding documents to the tree, retrieving information, and answering questions.
+    """
+    def __init__(self, config=None, tree=None):
+        """
+        Initializes a RetrievalAugmentation instance with the specified configuration.
+        Args:
+            config (RetrievalAugmentationConfig): The configuration for the RetrievalAugmentation instance.
+            tree: The tree instance or the path to a pickled tree file.
+        """
+        if config is None:
+            config = RetrievalAugmentationConfig()
+        if not isinstance(config, RetrievalAugmentationConfig):
+            raise ValueError(
+                "config must be an instance of RetrievalAugmentationConfig"
+            )
+        # Check if tree is a string (indicating a path to a pickled tree)
+        if isinstance(tree, str):
+            try:
+                with open(tree, "rb") as file:
+                    self.tree = pickle.load(file)
+                if not isinstance(self.tree, Tree):
+                    raise ValueError("The loaded object is not an instance of Tree")
+            except Exception as e:
+                raise ValueError(f"Failed to load tree from {tree}: {e}")
+        elif isinstance(tree, Tree) or tree is None:
+            self.tree = tree
+        else:
+            raise ValueError(
+                "tree must be an instance of Tree, a path to a pickled Tree, or None"
+            )
+        tree_builder_class = supported_tree_builders[config.tree_builder_type][0]
+        self.tree_builder = tree_builder_class(config.tree_builder_config)
+        self.tree_retriever_config = config.tree_retriever_config
+        self.qa_model = config.qa_model
+        if self.tree is not None:
+            self.retriever = TreeRetriever(self.tree_retriever_config, self.tree)
+        else:
+            self.retriever = None
+        logging.info(
+            f"Successfully initialized RetrievalAugmentation with Config {config.log_config()}"
+        )
+    def add_documents(self, docs):
+        """
+        Adds documents to the tree and creates a TreeRetriever instance.
+        Args:
+            docs (str): The input text to add to the tree.
+        """
+        if self.tree is not None:
+            user_input = input(
+                "Warning: Overwriting existing tree. Did you mean to call 'add_to_existing' instead? (y/n): "
+            )
+            if user_input.lower() == "y":
+                # self.add_to_existing(docs)
+                return
+        self.tree = self.tree_builder.build_from_text(text=docs)
+        self.retriever = TreeRetriever(self.tree_retriever_config, self.tree)
+    def retrieve(
+        self,
+        question,
+        start_layer: int = None,
+        num_layers: int = None,
+        top_k: int = 10,
+        max_tokens: int = 3500,
+        collapse_tree: bool = True,
+        return_layer_information: bool = True,
+    ):
+        """
+        Retrieves information and answers a question using the TreeRetriever instance.
+        Args:
+            question (str): The question to answer.
+            start_layer (int): The layer to start from. Defaults to self.start_layer.
+            num_layers (int): The number of layers to traverse. Defaults to self.num_layers.
+            max_tokens (int): The maximum number of tokens. Defaults to 3500.
+            use_all_information (bool): Whether to retrieve information from all nodes. Defaults to False.
+        Returns:
+            str: The context from which the answer can be found.
+        Raises:
+            ValueError: If the TreeRetriever instance has not been initialized.
+        """
+        if self.retriever is None:
+            raise ValueError(
+                "The TreeRetriever instance has not been initialized. Call 'add_documents' first."
+            )
+        return self.retriever.retrieve(
+            question,
+            start_layer,
+            num_layers,
+            top_k,
+            max_tokens,
+            collapse_tree,
+            return_layer_information,
+        )
+    def answer_question(
+        self,
+        question,
+        top_k: int = 10,
+        start_layer: int = None,
+        num_layers: int = None,
+        max_tokens: int = 3500,
+        collapse_tree: bool = True,
+        return_layer_information: bool = False,
+    ):
+        """
+        Retrieves information and answers a question using the TreeRetriever instance.
+        Args:
+            question (str): The question to answer.
+            start_layer (int): The layer to start from. Defaults to self.start_layer.
+            num_layers (int): The number of layers to traverse. Defaults to self.num_layers.
+            max_tokens (int): The maximum number of tokens. Defaults to 3500.
+            use_all_information (bool): Whether to retrieve information from all nodes. Defaults to False.
+        Returns:
+            str: The answer to the question.
+        Raises:
+            ValueError: If the TreeRetriever instance has not been initialized.
+        """
+        # if return_layer_information:
+        context, layer_information = self.retrieve(
+            question, start_layer, num_layers, top_k, max_tokens, collapse_tree, True
+        )
+        answer = self.qa_model.answer_question(context, question)
+        if return_layer_information:
+            return answer, layer_information
+        return answer
+    def save(self, path):
+        if self.tree is None:
+            raise ValueError("There is no tree to save.")
+        with open(path, "wb") as file:
+            pickle.dump(self.tree, file)
+        logging.info(f"Tree successfully saved to {path}")

baselines/raptor/raptor/Retrievers.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from abc import ABC, abstractmethod
+from typing import List
+class BaseRetriever(ABC):
+    @abstractmethod
+    def retrieve(self, query: str) -> str:
+        pass

baselines/raptor/raptor/SummarizationModels.py ADDED Viewed

	@@ -0,0 +1,74 @@

+import logging
+import os
+from abc import ABC, abstractmethod
+from openai import OpenAI
+from tenacity import retry, stop_after_attempt, wait_random_exponential
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+class BaseSummarizationModel(ABC):
+    @abstractmethod
+    def summarize(self, context, max_tokens=150):
+        pass
+class GPT3TurboSummarizationModel(BaseSummarizationModel):
+    def __init__(self, model="gpt-3.5-turbo"):
+        self.model = model
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def summarize(self, context, max_tokens=500, stop_sequence=None):
+        try:
+            client = OpenAI()
+            response = client.chat.completions.create(
+                model=self.model,
+                messages=[
+                    {"role": "system", "content": "You are a helpful assistant."},
+                    {
+                        "role": "user",
+                        "content": f"Write a summary of the following, including as many key details as possible: {context}:",
+                    },
+                ],
+                max_tokens=max_tokens,
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            print(e)
+            return e
+class GPT3SummarizationModel(BaseSummarizationModel):
+    def __init__(self, model="text-davinci-003"):
+        self.model = model
+    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
+    def summarize(self, context, max_tokens=500, stop_sequence=None):
+        try:
+            client = OpenAI()
+            response = client.chat.completions.create(
+                model=self.model,
+                messages=[
+                    {"role": "system", "content": "You are a helpful assistant."},
+                    {
+                        "role": "user",
+                        "content": f"Write a summary of the following, including as many key details as possible: {context}:",
+                    },
+                ],
+                max_tokens=max_tokens,
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            print(e)
+            return e

baselines/raptor/raptor/__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# raptor/__init__.py
+from .cluster_tree_builder import ClusterTreeBuilder, ClusterTreeConfig
+from .EmbeddingModels import (BaseEmbeddingModel, OpenAIEmbeddingModel,
+                              SBertEmbeddingModel)
+from .FaissRetriever import FaissRetriever, FaissRetrieverConfig
+from .QAModels import (BaseQAModel, GPT3QAModel, GPT3TurboQAModel, GPT4QAModel,
+                       UnifiedQAModel)
+from .RetrievalAugmentation import (RetrievalAugmentation,
+                                    RetrievalAugmentationConfig)
+from .Retrievers import BaseRetriever
+from .SummarizationModels import (BaseSummarizationModel,
+                                  GPT3SummarizationModel,
+                                  GPT3TurboSummarizationModel)
+from .tree_builder import TreeBuilder, TreeBuilderConfig
+from .tree_retriever import TreeRetriever, TreeRetrieverConfig
+from .tree_structures import Node, Tree

baselines/raptor/raptor/cluster_tree_builder.py ADDED Viewed

	@@ -0,0 +1,151 @@

+import logging
+import pickle
+from concurrent.futures import ThreadPoolExecutor
+from threading import Lock
+from typing import Dict, List, Set
+from .cluster_utils import ClusteringAlgorithm, RAPTOR_Clustering
+from .tree_builder import TreeBuilder, TreeBuilderConfig
+from .tree_structures import Node, Tree
+from .utils import (distances_from_embeddings, get_children, get_embeddings,
+                    get_node_list, get_text,
+                    indices_of_nearest_neighbors_from_distances, split_text)
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+class ClusterTreeConfig(TreeBuilderConfig):
+    def __init__(
+        self,
+        reduction_dimension=10,
+        clustering_algorithm=RAPTOR_Clustering,  # Default to RAPTOR clustering
+        clustering_params={},  # Pass additional params as a dict
+        *args,
+        **kwargs,
+    ):
+        super().__init__(*args, **kwargs)
+        self.reduction_dimension = reduction_dimension
+        self.clustering_algorithm = clustering_algorithm
+        self.clustering_params = clustering_params
+    def log_config(self):
+        base_summary = super().log_config()
+        cluster_tree_summary = f"""
+        Reduction Dimension: {self.reduction_dimension}
+        Clustering Algorithm: {self.clustering_algorithm.__name__}
+        Clustering Parameters: {self.clustering_params}
+        """
+        return base_summary + cluster_tree_summary
+class ClusterTreeBuilder(TreeBuilder):
+    def __init__(self, config) -> None:
+        super().__init__(config)
+        if not isinstance(config, ClusterTreeConfig):
+            raise ValueError("config must be an instance of ClusterTreeConfig")
+        self.reduction_dimension = config.reduction_dimension
+        self.clustering_algorithm = config.clustering_algorithm
+        self.clustering_params = config.clustering_params
+        logging.info(
+            f"Successfully initialized ClusterTreeBuilder with Config {config.log_config()}"
+        )
+    def construct_tree(
+        self,
+        current_level_nodes: Dict[int, Node],
+        all_tree_nodes: Dict[int, Node],
+        layer_to_nodes: Dict[int, List[Node]],
+        use_multithreading: bool = False,
+    ) -> Dict[int, Node]:
+        logging.info("Using Cluster TreeBuilder")
+        next_node_index = len(all_tree_nodes)
+        def process_cluster(
+            cluster, new_level_nodes, next_node_index, summarization_length, lock
+        ):
+            node_texts = get_text(cluster)
+            summarized_text = self.summarize(
+                context=node_texts,
+                max_tokens=summarization_length,
+            )
+            logging.info(
+                f"Node Texts Length: {len(self.tokenizer.encode(node_texts))}, Summarized Text Length: {len(self.tokenizer.encode(summarized_text))}"
+            )
+            __, new_parent_node = self.create_node(
+                next_node_index, summarized_text, {node.index for node in cluster}
+            )
+            with lock:
+                new_level_nodes[next_node_index] = new_parent_node
+        for layer in range(self.num_layers):
+            new_level_nodes = {}
+            logging.info(f"Constructing Layer {layer}")
+            node_list_current_layer = get_node_list(current_level_nodes)
+            if len(node_list_current_layer) <= self.reduction_dimension + 1:
+                self.num_layers = layer
+                logging.info(
+                    f"Stopping Layer construction: Cannot Create More Layers. Total Layers in tree: {layer}"
+                )
+                break
+            clusters = self.clustering_algorithm.perform_clustering(
+                node_list_current_layer,
+                self.cluster_embedding_model,
+                reduction_dimension=self.reduction_dimension,
+                **self.clustering_params,
+            )
+            lock = Lock()
+            summarization_length = self.summarization_length
+            logging.info(f"Summarization Length: {summarization_length}")
+            if use_multithreading:
+                with ThreadPoolExecutor() as executor:
+                    for cluster in clusters:
+                        executor.submit(
+                            process_cluster,
+                            cluster,
+                            new_level_nodes,
+                            next_node_index,
+                            summarization_length,
+                            lock,
+                        )
+                        next_node_index += 1
+                    executor.shutdown(wait=True)
+            else:
+                for cluster in clusters:
+                    process_cluster(
+                        cluster,
+                        new_level_nodes,
+                        next_node_index,
+                        summarization_length,
+                        lock,
+                    )
+                    next_node_index += 1
+            layer_to_nodes[layer + 1] = list(new_level_nodes.values())
+            current_level_nodes = new_level_nodes
+            all_tree_nodes.update(new_level_nodes)
+            tree = Tree(
+                all_tree_nodes,
+                layer_to_nodes[layer + 1],
+                layer_to_nodes[0],
+                layer + 1,
+                layer_to_nodes,
+            )
+        return current_level_nodes

baselines/raptor/raptor/cluster_utils.py ADDED Viewed

	@@ -0,0 +1,185 @@

+import logging
+import random
+from abc import ABC, abstractmethod
+from typing import List, Optional
+import numpy as np
+import tiktoken
+import umap
+from sklearn.mixture import GaussianMixture
+# Initialize logging
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+from .tree_structures import Node
+# Import necessary methods from other modules
+from .utils import get_embeddings
+# Set a random seed for reproducibility
+RANDOM_SEED = 224
+random.seed(RANDOM_SEED)
+def global_cluster_embeddings(
+    embeddings: np.ndarray,
+    dim: int,
+    n_neighbors: Optional[int] = None,
+    metric: str = "cosine",
+) -> np.ndarray:
+    if n_neighbors is None:
+        n_neighbors = int((len(embeddings) - 1) ** 0.5)
+    reduced_embeddings = umap.UMAP(
+        n_neighbors=n_neighbors, n_components=dim, metric=metric
+    ).fit_transform(embeddings)
+    return reduced_embeddings
+def local_cluster_embeddings(
+    embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = "cosine"
+) -> np.ndarray:
+    reduced_embeddings = umap.UMAP(
+        n_neighbors=num_neighbors, n_components=dim, metric=metric
+    ).fit_transform(embeddings)
+    return reduced_embeddings
+def get_optimal_clusters(
+    embeddings: np.ndarray, max_clusters: int = 50, random_state: int = RANDOM_SEED
+) -> int:
+    max_clusters = min(max_clusters, len(embeddings))
+    n_clusters = np.arange(1, max_clusters)
+    bics = []
+    for n in n_clusters:
+        gm = GaussianMixture(n_components=n, random_state=random_state)
+        gm.fit(embeddings)
+        bics.append(gm.bic(embeddings))
+    optimal_clusters = n_clusters[np.argmin(bics)]
+    return optimal_clusters
+def GMM_cluster(embeddings: np.ndarray, threshold: float, random_state: int = 0):
+    n_clusters = get_optimal_clusters(embeddings)
+    gm = GaussianMixture(n_components=n_clusters, random_state=random_state)
+    gm.fit(embeddings)
+    probs = gm.predict_proba(embeddings)
+    labels = [np.where(prob > threshold)[0] for prob in probs]
+    return labels, n_clusters
+def perform_clustering(
+    embeddings: np.ndarray, dim: int, threshold: float, verbose: bool = False
+) -> List[np.ndarray]:
+    reduced_embeddings_global = global_cluster_embeddings(embeddings, min(dim, len(embeddings) -2))
+    global_clusters, n_global_clusters = GMM_cluster(
+        reduced_embeddings_global, threshold
+    )
+    if verbose:
+        logging.info(f"Global Clusters: {n_global_clusters}")
+    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
+    total_clusters = 0
+    for i in range(n_global_clusters):
+        global_cluster_embeddings_ = embeddings[
+            np.array([i in gc for gc in global_clusters])
+        ]
+        if verbose:
+            logging.info(
+                f"Nodes in Global Cluster {i}: {len(global_cluster_embeddings_)}"
+            )
+        if len(global_cluster_embeddings_) == 0:
+            continue
+        if len(global_cluster_embeddings_) <= dim + 1:
+            local_clusters = [np.array([0]) for _ in global_cluster_embeddings_]
+            n_local_clusters = 1
+        else:
+            reduced_embeddings_local = local_cluster_embeddings(
+                global_cluster_embeddings_, dim
+            )
+            local_clusters, n_local_clusters = GMM_cluster(
+                reduced_embeddings_local, threshold
+            )
+        if verbose:
+            logging.info(f"Local Clusters in Global Cluster {i}: {n_local_clusters}")
+        for j in range(n_local_clusters):
+            local_cluster_embeddings_ = global_cluster_embeddings_[
+                np.array([j in lc for lc in local_clusters])
+            ]
+            indices = np.where(
+                (embeddings == local_cluster_embeddings_[:, None]).all(-1)
+            )[1]
+            for idx in indices:
+                all_local_clusters[idx] = np.append(
+                    all_local_clusters[idx], j + total_clusters
+                )
+        total_clusters += n_local_clusters
+    if verbose:
+        logging.info(f"Total Clusters: {total_clusters}")
+    return all_local_clusters
+class ClusteringAlgorithm(ABC):
+    @abstractmethod
+    def perform_clustering(self, embeddings: np.ndarray, **kwargs) -> List[List[int]]:
+        pass
+class RAPTOR_Clustering(ClusteringAlgorithm):
+    def perform_clustering(
+        nodes: List[Node],
+        embedding_model_name: str,
+        max_length_in_cluster: int = 3500,
+        tokenizer=tiktoken.get_encoding("cl100k_base"),
+        reduction_dimension: int = 10,
+        threshold: float = 0.1,
+        verbose: bool = False,
+    ) -> List[List[Node]]:
+        # Get the embeddings from the nodes
+        embeddings = np.array([node.embeddings[embedding_model_name] for node in nodes])
+        # Perform the clustering
+        clusters = perform_clustering(
+            embeddings, dim=reduction_dimension, threshold=threshold
+        )
+        # Initialize an empty list to store the clusters of nodes
+        node_clusters = []
+        # Iterate over each unique label in the clusters
+        for label in np.unique(np.concatenate(clusters)):
+            # Get the indices of the nodes that belong to this cluster
+            indices = [i for i, cluster in enumerate(clusters) if label in cluster]
+            # Add the corresponding nodes to the node_clusters list
+            cluster_nodes = [nodes[i] for i in indices]
+            # Base case: if the cluster only has one node, do not attempt to recluster it
+            if len(cluster_nodes) == 1:
+                node_clusters.append(cluster_nodes)
+                continue
+            # Calculate the total length of the text in the nodes
+            total_length = sum(
+                [len(tokenizer.encode(node.text)) for node in cluster_nodes]
+            )
+            # If the total length exceeds the maximum allowed length, recluster this cluster
+            if total_length > max_length_in_cluster:
+                if verbose:
+                    logging.info(
+                        f"reclustering cluster with {len(cluster_nodes)} nodes"
+                    )
+                node_clusters.extend(
+                    RAPTOR_Clustering.perform_clustering(
+                        cluster_nodes, embedding_model_name, max_length_in_cluster
+                    )
+                )
+            else:
+                node_clusters.append(cluster_nodes)
+        return node_clusters

baselines/raptor/raptor/tree_builder.py ADDED Viewed

	@@ -0,0 +1,369 @@

+import copy
+import logging
+import os
+from abc import abstractclassmethod
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from threading import Lock
+from typing import Dict, List, Optional, Set, Tuple
+import openai
+import tiktoken
+from tenacity import retry, stop_after_attempt, wait_random_exponential
+from .EmbeddingModels import BaseEmbeddingModel, OpenAIEmbeddingModel
+from .SummarizationModels import (BaseSummarizationModel,
+                                  GPT3TurboSummarizationModel)
+from .tree_structures import Node, Tree
+from .utils import (distances_from_embeddings, get_children, get_embeddings,
+                    get_node_list, get_text,
+                    indices_of_nearest_neighbors_from_distances, split_text)
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+class TreeBuilderConfig:
+    def __init__(
+        self,
+        tokenizer=None,
+        max_tokens=None,
+        num_layers=None,
+        threshold=None,
+        top_k=None,
+        selection_mode=None,
+        summarization_length=None,
+        summarization_model=None,
+        embedding_models=None,
+        cluster_embedding_model=None,
+    ):
+        if tokenizer is None:
+            tokenizer = tiktoken.get_encoding("cl100k_base")
+        self.tokenizer = tokenizer
+        if max_tokens is None:
+            max_tokens = 100
+        if not isinstance(max_tokens, int) or max_tokens < 1:
+            raise ValueError("max_tokens must be an integer and at least 1")
+        self.max_tokens = max_tokens
+        if num_layers is None:
+            num_layers = 5
+        if not isinstance(num_layers, int) or num_layers < 1:
+            raise ValueError("num_layers must be an integer and at least 1")
+        self.num_layers = num_layers
+        if threshold is None:
+            threshold = 0.5
+        if not isinstance(threshold, (int, float)) or not (0 <= threshold <= 1):
+            raise ValueError("threshold must be a number between 0 and 1")
+        self.threshold = threshold
+        if top_k is None:
+            top_k = 5
+        if not isinstance(top_k, int) or top_k < 1:
+            raise ValueError("top_k must be an integer and at least 1")
+        self.top_k = top_k
+        if selection_mode is None:
+            selection_mode = "top_k"
+        if selection_mode not in ["top_k", "threshold"]:
+            raise ValueError("selection_mode must be either 'top_k' or 'threshold'")
+        self.selection_mode = selection_mode
+        if summarization_length is None:
+            summarization_length = 100
+        self.summarization_length = summarization_length
+        if summarization_model is None:
+            summarization_model = GPT3TurboSummarizationModel()
+        if not isinstance(summarization_model, BaseSummarizationModel):
+            raise ValueError(
+                "summarization_model must be an instance of BaseSummarizationModel"
+            )
+        self.summarization_model = summarization_model
+        if embedding_models is None:
+            embedding_models = {"OpenAI": OpenAIEmbeddingModel()}
+        if not isinstance(embedding_models, dict):
+            raise ValueError(
+                "embedding_models must be a dictionary of model_name: instance pairs"
+            )
+        for model in embedding_models.values():
+            if not isinstance(model, BaseEmbeddingModel):
+                raise ValueError(
+                    "All embedding models must be an instance of BaseEmbeddingModel"
+                )
+        self.embedding_models = embedding_models
+        if cluster_embedding_model is None:
+            cluster_embedding_model = "OpenAI"
+        if cluster_embedding_model not in self.embedding_models:
+            raise ValueError(
+                "cluster_embedding_model must be a key in the embedding_models dictionary"
+            )
+        self.cluster_embedding_model = cluster_embedding_model
+    def log_config(self):
+        config_log = """
+        TreeBuilderConfig:
+            Tokenizer: {tokenizer}
+            Max Tokens: {max_tokens}
+            Num Layers: {num_layers}
+            Threshold: {threshold}
+            Top K: {top_k}
+            Selection Mode: {selection_mode}
+            Summarization Length: {summarization_length}
+            Summarization Model: {summarization_model}
+            Embedding Models: {embedding_models}
+            Cluster Embedding Model: {cluster_embedding_model}
+        """.format(
+            tokenizer=self.tokenizer,
+            max_tokens=self.max_tokens,
+            num_layers=self.num_layers,
+            threshold=self.threshold,
+            top_k=self.top_k,
+            selection_mode=self.selection_mode,
+            summarization_length=self.summarization_length,
+            summarization_model=self.summarization_model,
+            embedding_models=self.embedding_models,
+            cluster_embedding_model=self.cluster_embedding_model,
+        )
+        return config_log
+class TreeBuilder:
+    """
+    The TreeBuilder class is responsible for building a hierarchical text abstraction
+    structure, known as a "tree," using summarization models and
+    embedding models.
+    """
+    def __init__(self, config) -> None:
+        """Initializes the tokenizer, maximum tokens, number of layers, top-k value, threshold, and selection mode."""
+        self.tokenizer = config.tokenizer
+        self.max_tokens = config.max_tokens
+        self.num_layers = config.num_layers
+        self.top_k = config.top_k
+        self.threshold = config.threshold
+        self.selection_mode = config.selection_mode
+        self.summarization_length = config.summarization_length
+        self.summarization_model = config.summarization_model
+        self.embedding_models = config.embedding_models
+        self.cluster_embedding_model = config.cluster_embedding_model
+        logging.info(
+            f"Successfully initialized TreeBuilder with Config {config.log_config()}"
+        )
+    def create_node(
+        self, index: int, text: str, children_indices: Optional[Set[int]] = None
+    ) -> Tuple[int, Node]:
+        """Creates a new node with the given index, text, and (optionally) children indices.
+        Args:
+            index (int): The index of the new node.
+            text (str): The text associated with the new node.
+            children_indices (Optional[Set[int]]): A set of indices representing the children of the new node.
+                If not provided, an empty set will be used.
+        Returns:
+            Tuple[int, Node]: A tuple containing the index and the newly created node.
+        """
+        if children_indices is None:
+            children_indices = set()
+        embeddings = {
+            model_name: model.create_embedding(text)
+            for model_name, model in self.embedding_models.items()
+        }
+        return (index, Node(text, index, children_indices, embeddings))
+    def create_embedding(self, text) -> List[float]:
+        """
+        Generates embeddings for the given text using the specified embedding model.
+        Args:
+            text (str): The text for which to generate embeddings.
+        Returns:
+            List[float]: The generated embeddings.
+        """
+        return self.embedding_models[self.cluster_embedding_model].create_embedding(
+            text
+        )
+    def summarize(self, context, max_tokens=150) -> str:
+        """
+        Generates a summary of the input context using the specified summarization model.
+        Args:
+            context (str, optional): The context to summarize.
+            max_tokens (int, optional): The maximum number of tokens in the generated summary. Defaults to 150.o
+        Returns:
+            str: The generated summary.
+        """
+        return self.summarization_model.summarize(context, max_tokens)
+    def get_relevant_nodes(self, current_node, list_nodes) -> List[Node]:
+        """
+        Retrieves the top-k most relevant nodes to the current node from the list of nodes
+        based on cosine distance in the embedding space.
+        Args:
+            current_node (Node): The current node.
+            list_nodes (List[Node]): The list of nodes.
+        Returns:
+            List[Node]: The top-k most relevant nodes.
+        """
+        embeddings = get_embeddings(list_nodes, self.cluster_embedding_model)
+        distances = distances_from_embeddings(
+            current_node.embeddings[self.cluster_embedding_model], embeddings
+        )
+        indices = indices_of_nearest_neighbors_from_distances(distances)
+        if self.selection_mode == "threshold":
+            best_indices = [
+                index for index in indices if distances[index] > self.threshold
+            ]
+        elif self.selection_mode == "top_k":
+            best_indices = indices[: self.top_k]
+        nodes_to_add = [list_nodes[idx] for idx in best_indices]
+        return nodes_to_add
+    def multithreaded_create_leaf_nodes(self, chunks: List[str]) -> Dict[int, Node]:
+        """Creates leaf nodes using multithreading from the given list of text chunks.
+        Args:
+            chunks (List[str]): A list of text chunks to be turned into leaf nodes.
+        Returns:
+            Dict[int, Node]: A dictionary mapping node indices to the corresponding leaf nodes.
+        """
+        with ThreadPoolExecutor() as executor:
+            future_nodes = {
+                executor.submit(self.create_node, index, text): (index, text)
+                for index, text in enumerate(chunks)
+            }
+            leaf_nodes = {}
+            for future in as_completed(future_nodes):
+                index, node = future.result()
+                leaf_nodes[index] = node
+        return leaf_nodes
+    def build_from_text(self, text: str, use_multithreading: bool = True) -> Tree:
+        """Builds a golden tree from the input text, optionally using multithreading.
+        Args:
+            text (str): The input text.
+            use_multithreading (bool, optional): Whether to use multithreading when creating leaf nodes.
+                Default: True.
+        Returns:
+            Tree: The golden tree structure.
+        """
+        chunks = split_text(text, self.tokenizer, self.max_tokens)
+        logging.info("Creating Leaf Nodes")
+        if use_multithreading:
+            leaf_nodes = self.multithreaded_create_leaf_nodes(chunks)
+        else:
+            leaf_nodes = {}
+            for index, text in enumerate(chunks):
+                __, node = self.create_node(index, text)
+                leaf_nodes[index] = node
+        layer_to_nodes = {0: list(leaf_nodes.values())}
+        logging.info(f"Created {len(leaf_nodes)} Leaf Embeddings")
+        logging.info("Building All Nodes")
+        all_nodes = copy.deepcopy(leaf_nodes)
+        root_nodes = self.construct_tree(all_nodes, all_nodes, layer_to_nodes)
+        tree = Tree(all_nodes, root_nodes, leaf_nodes, self.num_layers, layer_to_nodes)
+        return tree
+    @abstractclassmethod
+    def construct_tree(
+        self,
+        current_level_nodes: Dict[int, Node],
+        all_tree_nodes: Dict[int, Node],
+        layer_to_nodes: Dict[int, List[Node]],
+        use_multithreading: bool = True,
+    ) -> Dict[int, Node]:
+        """
+        Constructs the hierarchical tree structure layer by layer by iteratively summarizing groups
+        of relevant nodes and updating the current_level_nodes and all_tree_nodes dictionaries at each step.
+        Args:
+            current_level_nodes (Dict[int, Node]): The current set of nodes.
+            all_tree_nodes (Dict[int, Node]): The dictionary of all nodes.
+            use_multithreading (bool): Whether to use multithreading to speed up the process.
+        Returns:
+            Dict[int, Node]: The final set of root nodes.
+        """
+        pass
+        # logging.info("Using Transformer-like TreeBuilder")
+        # def process_node(idx, current_level_nodes, new_level_nodes, all_tree_nodes, next_node_index, lock):
+        #     relevant_nodes_chunk = self.get_relevant_nodes(
+        #         current_level_nodes[idx], current_level_nodes
+        #     )
+        #     node_texts = get_text(relevant_nodes_chunk)
+        #     summarized_text = self.summarize(
+        #         context=node_texts,
+        #         max_tokens=self.summarization_length,
+        #     )
+        #     logging.info(
+        #         f"Node Texts Length: {len(self.tokenizer.encode(node_texts))}, Summarized Text Length: {len(self.tokenizer.encode(summarized_text))}"
+        #     )
+        #     next_node_index, new_parent_node = self.create_node(
+        #         next_node_index,
+        #         summarized_text,
+        #         {node.index for node in relevant_nodes_chunk}
+        #     )
+        #     with lock:
+        #         new_level_nodes[next_node_index] = new_parent_node
+        # for layer in range(self.num_layers):
+        #     logging.info(f"Constructing Layer {layer}: ")
+        #     node_list_current_layer = get_node_list(current_level_nodes)
+        #     next_node_index = len(all_tree_nodes)
+        #     new_level_nodes = {}
+        #     lock = Lock()
+        #     if use_multithreading:
+        #         with ThreadPoolExecutor() as executor:
+        #             for idx in range(0, len(node_list_current_layer)):
+        #                 executor.submit(process_node, idx, node_list_current_layer, new_level_nodes, all_tree_nodes, next_node_index, lock)
+        #                 next_node_index += 1
+        #             executor.shutdown(wait=True)
+        #     else:
+        #         for idx in range(0, len(node_list_current_layer)):
+        #             process_node(idx, node_list_current_layer, new_level_nodes, all_tree_nodes, next_node_index, lock)
+        #     layer_to_nodes[layer + 1] = list(new_level_nodes.values())
+        #     current_level_nodes = new_level_nodes
+        #     all_tree_nodes.update(new_level_nodes)
+        # return new_level_nodes

baselines/raptor/raptor/tree_retriever.py ADDED Viewed

	@@ -0,0 +1,327 @@

+import logging
+import os
+from typing import Dict, List, Set
+import tiktoken
+from tenacity import retry, stop_after_attempt, wait_random_exponential
+from .EmbeddingModels import BaseEmbeddingModel, OpenAIEmbeddingModel
+from .Retrievers import BaseRetriever
+from .tree_structures import Node, Tree
+from .utils import (distances_from_embeddings, get_children, get_embeddings,
+                    get_node_list, get_text,
+                    indices_of_nearest_neighbors_from_distances,
+                    reverse_mapping)
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+class TreeRetrieverConfig:
+    def __init__(
+        self,
+        tokenizer=None,
+        threshold=None,
+        top_k=None,
+        selection_mode=None,
+        context_embedding_model=None,
+        embedding_model=None,
+        num_layers=None,
+        start_layer=None,
+    ):
+        if tokenizer is None:
+            tokenizer = tiktoken.get_encoding("cl100k_base")
+        self.tokenizer = tokenizer
+        if threshold is None:
+            threshold = 0.5
+        if not isinstance(threshold, float) or not (0 <= threshold <= 1):
+            raise ValueError("threshold must be a float between 0 and 1")
+        self.threshold = threshold
+        if top_k is None:
+            top_k = 5
+        if not isinstance(top_k, int) or top_k < 1:
+            raise ValueError("top_k must be an integer and at least 1")
+        self.top_k = top_k
+        if selection_mode is None:
+            selection_mode = "top_k"
+        if not isinstance(selection_mode, str) or selection_mode not in [
+            "top_k",
+            "threshold",
+        ]:
+            raise ValueError(
+                "selection_mode must be a string and either 'top_k' or 'threshold'"
+            )
+        self.selection_mode = selection_mode
+        if context_embedding_model is None:
+            context_embedding_model = "OpenAI"
+        if not isinstance(context_embedding_model, str):
+            raise ValueError("context_embedding_model must be a string")
+        self.context_embedding_model = context_embedding_model
+        if embedding_model is None:
+            embedding_model = OpenAIEmbeddingModel()
+        if not isinstance(embedding_model, BaseEmbeddingModel):
+            raise ValueError(
+                "embedding_model must be an instance of BaseEmbeddingModel"
+            )
+        self.embedding_model = embedding_model
+        if num_layers is not None:
+            if not isinstance(num_layers, int) or num_layers < 0:
+                raise ValueError("num_layers must be an integer and at least 0")
+        self.num_layers = num_layers
+        if start_layer is not None:
+            if not isinstance(start_layer, int) or start_layer < 0:
+                raise ValueError("start_layer must be an integer and at least 0")
+        self.start_layer = start_layer
+    def log_config(self):
+        config_log = """
+        TreeRetrieverConfig:
+            Tokenizer: {tokenizer}
+            Threshold: {threshold}
+            Top K: {top_k}
+            Selection Mode: {selection_mode}
+            Context Embedding Model: {context_embedding_model}
+            Embedding Model: {embedding_model}
+            Num Layers: {num_layers}
+            Start Layer: {start_layer}
+        """.format(
+            tokenizer=self.tokenizer,
+            threshold=self.threshold,
+            top_k=self.top_k,
+            selection_mode=self.selection_mode,
+            context_embedding_model=self.context_embedding_model,
+            embedding_model=self.embedding_model,
+            num_layers=self.num_layers,
+            start_layer=self.start_layer,
+        )
+        return config_log
+class TreeRetriever(BaseRetriever):
+    def __init__(self, config, tree) -> None:
+        if not isinstance(tree, Tree):
+            raise ValueError("tree must be an instance of Tree")
+        if config.num_layers is not None and config.num_layers > tree.num_layers + 1:
+            raise ValueError(
+                "num_layers in config must be less than or equal to tree.num_layers + 1"
+            )
+        if config.start_layer is not None and config.start_layer > tree.num_layers:
+            raise ValueError(
+                "start_layer in config must be less than or equal to tree.num_layers"
+            )
+        self.tree = tree
+        self.num_layers = (
+            config.num_layers if config.num_layers is not None else tree.num_layers + 1
+        )
+        self.start_layer = (
+            config.start_layer if config.start_layer is not None else tree.num_layers
+        )
+        if self.num_layers > self.start_layer + 1:
+            raise ValueError("num_layers must be less than or equal to start_layer + 1")
+        self.tokenizer = config.tokenizer
+        self.top_k = config.top_k
+        self.threshold = config.threshold
+        self.selection_mode = config.selection_mode
+        self.embedding_model = config.embedding_model
+        self.context_embedding_model = config.context_embedding_model
+        self.tree_node_index_to_layer = reverse_mapping(self.tree.layer_to_nodes)
+        logging.info(
+            f"Successfully initialized TreeRetriever with Config {config.log_config()}"
+        )
+    def create_embedding(self, text: str) -> List[float]:
+        """
+        Generates embeddings for the given text using the specified embedding model.
+        Args:
+            text (str): The text for which to generate embeddings.
+        Returns:
+            List[float]: The generated embeddings.
+        """
+        return self.embedding_model.create_embedding(text)
+    def retrieve_information_collapse_tree(self, query: str, top_k: int, max_tokens: int) -> str:
+        """
+        Retrieves the most relevant information from the tree based on the query.
+        Args:
+            query (str): The query text.
+            max_tokens (int): The maximum number of tokens.
+        Returns:
+            str: The context created using the most relevant nodes.
+        """
+        query_embedding = self.create_embedding(query)
+        selected_nodes = []
+        node_list = get_node_list(self.tree.all_nodes)
+        embeddings = get_embeddings(node_list, self.context_embedding_model)
+        distances = distances_from_embeddings(query_embedding, embeddings)
+        indices = indices_of_nearest_neighbors_from_distances(distances)
+        total_tokens = 0
+        for idx in indices[:top_k]:
+            node = node_list[idx]
+            node_tokens = len(self.tokenizer.encode(node.text))
+            if total_tokens + node_tokens > max_tokens:
+                break
+            selected_nodes.append(node)
+            total_tokens += node_tokens
+        context = get_text(selected_nodes)
+        return selected_nodes, context
+    def retrieve_information(
+        self, current_nodes: List[Node], query: str, num_layers: int
+    ) -> str:
+        """
+        Retrieves the most relevant information from the tree based on the query.
+        Args:
+            current_nodes (List[Node]): A List of the current nodes.
+            query (str): The query text.
+            num_layers (int): The number of layers to traverse.
+        Returns:
+            str: The context created using the most relevant nodes.
+        """
+        query_embedding = self.create_embedding(query)
+        selected_nodes = []
+        node_list = current_nodes
+        for layer in range(num_layers):
+            embeddings = get_embeddings(node_list, self.context_embedding_model)
+            distances = distances_from_embeddings(query_embedding, embeddings)
+            indices = indices_of_nearest_neighbors_from_distances(distances)
+            if self.selection_mode == "threshold":
+                best_indices = [
+                    index for index in indices if distances[index] > self.threshold
+                ]
+            elif self.selection_mode == "top_k":
+                best_indices = indices[: self.top_k]
+            nodes_to_add = [node_list[idx] for idx in best_indices]
+            selected_nodes.extend(nodes_to_add)
+            if layer != num_layers - 1:
+                child_nodes = []
+                for index in best_indices:
+                    child_nodes.extend(node_list[index].children)
+                # take the unique values
+                child_nodes = list(dict.fromkeys(child_nodes))
+                node_list = [self.tree.all_nodes[i] for i in child_nodes]
+        context = get_text(selected_nodes)
+        return selected_nodes, context
+    def retrieve(
+        self,
+        query: str,
+        start_layer: int = None,
+        num_layers: int = None,
+        top_k: int = 10,
+        max_tokens: int = 3500,
+        collapse_tree: bool = True,
+        return_layer_information: bool = False,
+    ) -> str:
+        """
+        Queries the tree and returns the most relevant information.
+        Args:
+            query (str): The query text.
+            start_layer (int): The layer to start from. Defaults to self.start_layer.
+            num_layers (int): The number of layers to traverse. Defaults to self.num_layers.
+            max_tokens (int): The maximum number of tokens. Defaults to 3500.
+            collapse_tree (bool): Whether to retrieve information from all nodes. Defaults to False.
+        Returns:
+            str: The result of the query.
+        """
+        if not isinstance(query, str):
+            raise ValueError("query must be a string")
+        if not isinstance(max_tokens, int) or max_tokens < 1:
+            raise ValueError("max_tokens must be an integer and at least 1")
+        if not isinstance(collapse_tree, bool):
+            raise ValueError("collapse_tree must be a boolean")
+        # Set defaults
+        start_layer = self.start_layer if start_layer is None else start_layer
+        num_layers = self.num_layers if num_layers is None else num_layers
+        if not isinstance(start_layer, int) or not (
+            0 <= start_layer <= self.tree.num_layers
+        ):
+            raise ValueError(
+                "start_layer must be an integer between 0 and tree.num_layers"
+            )
+        if not isinstance(num_layers, int) or num_layers < 1:
+            raise ValueError("num_layers must be an integer and at least 1")
+        if num_layers > (start_layer + 1):
+            raise ValueError("num_layers must be less than or equal to start_layer + 1")
+        if collapse_tree:
+            logging.info(f"Using collapsed_tree")
+            selected_nodes, context = self.retrieve_information_collapse_tree(
+                query, top_k, max_tokens
+            )
+        else:
+            layer_nodes = self.tree.layer_to_nodes[start_layer]
+            selected_nodes, context = self.retrieve_information(
+                layer_nodes, query, num_layers
+            )
+        if return_layer_information:
+            layer_information = []
+            for node in selected_nodes:
+                layer_information.append(
+                    {
+                        "node_index": node.index,
+                        "layer_number": self.tree_node_index_to_layer[node.index],
+                    }
+                )
+            return context, layer_information
+        return context

baselines/raptor/raptor/tree_structures.py ADDED Viewed

	@@ -0,0 +1,28 @@

+from typing import Dict, List, Set
+class Node:
+    """
+    Represents a node in the hierarchical tree structure.
+    """
+    def __init__(self, text: str, index: int, children: Set[int], embeddings) -> None:
+        self.text = text
+        self.index = index
+        self.children = children
+        self.embeddings = embeddings
+class Tree:
+    """
+    Represents the entire hierarchical tree structure.
+    """
+    def __init__(
+        self, all_nodes, root_nodes, leaf_nodes, num_layers, layer_to_nodes
+    ) -> None:
+        self.all_nodes = all_nodes
+        self.root_nodes = root_nodes
+        self.leaf_nodes = leaf_nodes
+        self.num_layers = num_layers
+        self.layer_to_nodes = layer_to_nodes

baselines/raptor/raptor/utils.py ADDED Viewed

	@@ -0,0 +1,208 @@

+import logging
+import re
+from typing import Dict, List, Set
+import numpy as np
+import tiktoken
+from scipy import spatial
+from .tree_structures import Node
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+def reverse_mapping(layer_to_nodes: Dict[int, List[Node]]) -> Dict[Node, int]:
+    node_to_layer = {}
+    for layer, nodes in layer_to_nodes.items():
+        for node in nodes:
+            node_to_layer[node.index] = layer
+    return node_to_layer
+def split_text(
+    text: str, tokenizer: tiktoken.get_encoding("cl100k_base"), max_tokens: int, overlap: int = 0
+):
+    """
+    Splits the input text into smaller chunks based on the tokenizer and maximum allowed tokens.
+    Args:
+        text (str): The text to be split.
+        tokenizer (CustomTokenizer): The tokenizer to be used for splitting the text.
+        max_tokens (int): The maximum allowed tokens.
+        overlap (int, optional): The number of overlapping tokens between chunks. Defaults to 0.
+    Returns:
+        List[str]: A list of text chunks.
+    """
+    # Split the text into sentences using multiple delimiters
+    delimiters = [".", "!", "?", "\n"]
+    regex_pattern = "|".join(map(re.escape, delimiters))
+    sentences = re.split(regex_pattern, text)
+    # Calculate the number of tokens for each sentence
+    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
+    chunks = []
+    current_chunk = []
+    current_length = 0
+    for sentence, token_count in zip(sentences, n_tokens):
+        # If the sentence is empty or consists only of whitespace, skip it
+        if not sentence.strip():
+            continue
+        # If the sentence is too long, split it into smaller parts
+        if token_count > max_tokens:
+            sub_sentences = re.split(r"[,;:]", sentence)
+            # there is no need to keep empty os only-spaced strings
+            # since spaces will be inserted in the beginning of the full string
+            # and in between the string in the sub_chuk list
+            filtered_sub_sentences = [sub.strip() for sub in sub_sentences if sub.strip() != ""]
+            sub_token_counts = [len(tokenizer.encode(" " + sub_sentence)) for sub_sentence in filtered_sub_sentences]
+            sub_chunk = []
+            sub_length = 0
+            for sub_sentence, sub_token_count in zip(filtered_sub_sentences, sub_token_counts):
+                if sub_length + sub_token_count > max_tokens:
+                    # if the phrase does not have sub_sentences, it would create an empty chunk
+                    # this big phrase would be added anyways in the next chunk append
+                    if sub_chunk:
+                        chunks.append(" ".join(sub_chunk))
+                        sub_chunk = sub_chunk[-overlap:] if overlap > 0 else []
+                        sub_length = sum(sub_token_counts[max(0, len(sub_chunk) - overlap):len(sub_chunk)])
+                sub_chunk.append(sub_sentence)
+                sub_length += sub_token_count
+            if sub_chunk:
+                chunks.append(" ".join(sub_chunk))
+        # If adding the sentence to the current chunk exceeds the max tokens, start a new chunk
+        elif current_length + token_count > max_tokens:
+            chunks.append(" ".join(current_chunk))
+            current_chunk = current_chunk[-overlap:] if overlap > 0 else []
+            current_length = sum(n_tokens[max(0, len(current_chunk) - overlap):len(current_chunk)])
+            current_chunk.append(sentence)
+            current_length += token_count
+        # Otherwise, add the sentence to the current chunk
+        else:
+            current_chunk.append(sentence)
+            current_length += token_count
+    # Add the last chunk if it's not empty
+    if current_chunk:
+        chunks.append(" ".join(current_chunk))
+    return chunks
+def distances_from_embeddings(
+    query_embedding: List[float],
+    embeddings: List[List[float]],
+    distance_metric: str = "cosine",
+) -> List[float]:
+    """
+    Calculates the distances between a query embedding and a list of embeddings.
+    Args:
+        query_embedding (List[float]): The query embedding.
+        embeddings (List[List[float]]): A list of embeddings to compare against the query embedding.
+        distance_metric (str, optional): The distance metric to use for calculation. Defaults to 'cosine'.
+    Returns:
+        List[float]: The calculated distances between the query embedding and the list of embeddings.
+    """
+    distance_metrics = {
+        "cosine": spatial.distance.cosine,
+        "L1": spatial.distance.cityblock,
+        "L2": spatial.distance.euclidean,
+        "Linf": spatial.distance.chebyshev,
+    }
+    if distance_metric not in distance_metrics:
+        raise ValueError(
+            f"Unsupported distance metric '{distance_metric}'. Supported metrics are: {list(distance_metrics.keys())}"
+        )
+    distances = [
+        distance_metrics[distance_metric](query_embedding, embedding)
+        for embedding in embeddings
+    ]
+    return distances
+def get_node_list(node_dict: Dict[int, Node]) -> List[Node]:
+    """
+    Converts a dictionary of node indices to a sorted list of nodes.
+    Args:
+        node_dict (Dict[int, Node]): Dictionary of node indices to nodes.
+    Returns:
+        List[Node]: Sorted list of nodes.
+    """
+    indices = sorted(node_dict.keys())
+    node_list = [node_dict[index] for index in indices]
+    return node_list
+def get_embeddings(node_list: List[Node], embedding_model: str) -> List:
+    """
+    Extracts the embeddings of nodes from a list of nodes.
+    Args:
+        node_list (List[Node]): List of nodes.
+        embedding_model (str): The name of the embedding model to be used.
+    Returns:
+        List: List of node embeddings.
+    """
+    return [node.embeddings[embedding_model] for node in node_list]
+def get_children(node_list: List[Node]) -> List[Set[int]]:
+    """
+    Extracts the children of nodes from a list of nodes.
+    Args:
+        node_list (List[Node]): List of nodes.
+    Returns:
+        List[Set[int]]: List of sets of node children indices.
+    """
+    return [node.children for node in node_list]
+def get_text(node_list: List[Node]) -> str:
+    """
+    Generates a single text string by concatenating the text from a list of nodes.
+    Args:
+        node_list (List[Node]): List of nodes.
+    Returns:
+        str: Concatenated text.
+    """
+    text = ""
+    for node in node_list:
+        text += f"{' '.join(node.text.splitlines())}"
+        text += "\n\n"
+    return text
+def indices_of_nearest_neighbors_from_distances(distances: List[float]) -> np.ndarray:
+    """
+    Returns the indices of nearest neighbors sorted in ascending order of distance.
+    Args:
+        distances (List[float]): A list of distances between embeddings.
+    Returns:
+        np.ndarray: An array of indices sorted by ascending distance.
+    """
+    return np.argsort(distances)

baselines/raptor/requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+faiss-cpu
+numpy==1.26.3
+openai==1.3.3
+scikit-learn
+sentence-transformers==2.2.2
+tenacity==8.2.3
+tiktoken==0.5.1
+torch
+transformers==4.38.1
+umap-learn==0.5.5
+urllib3==1.26.6

baselines/raptor/run_raptor_baseline.py ADDED Viewed

	@@ -0,0 +1,511 @@

+"""
+RAPTOR baseline for the EvolV-Mem benchmark.
+Builds a RAPTOR tree per question from session summaries, retrieves context
+via collapsed-tree retrieval, and generates answers using Qwen-30B via vLLM.
+Usage:
+    python baselines/raptor/run_raptor_baseline.py \
+        --in_file dataset/evolv_mem_v4.json \
+        --out_file output/raptor_qwen30b.jsonl \
+        --summary_file dataset/all_session_summary.json \
+        --profile_file metadata/generated_user_profile.json
+Env vars:
+    VLLM_BASE_URL  (default http://localhost:8000/v1)
+    VLLM_API_KEY   (default EMPTY)
+"""
+import argparse
+import copy
+import json
+import logging
+import os
+import pickle
+import re
+import sys
+import time
+from abc import ABC, abstractmethod
+from collections import defaultdict
+from typing import Dict, List, Optional
+from tqdm import tqdm
+# ---------------------------------------------------------------------------
+# Add the raptor package to the path so we can import it directly
+# ---------------------------------------------------------------------------
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.insert(0, SCRIPT_DIR)
+from raptor import (
+    BaseEmbeddingModel,
+    BaseQAModel,
+    BaseSummarizationModel,
+    RetrievalAugmentation,
+    RetrievalAugmentationConfig,
+    SBertEmbeddingModel,
+    TreeRetriever,
+)
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+# ---------------------------------------------------------------------------
+# vLLM-backed Summarization Model
+# ---------------------------------------------------------------------------
+class VLLMSummarizationModel(BaseSummarizationModel):
+    """Summarization model backed by a vLLM OpenAI-compatible server."""
+    def __init__(
+        self,
+        model_name: str = None,
+        base_url: str = None,
+        api_key: str = None,
+    ):
+        from openai import OpenAI
+        self.model_name = (
+            model_name
+            or os.getenv("VLLM_MODEL_NAME")
+            or "Qwen/Qwen3-30B-A3B-Instruct-2507"
+        )
+        self.client = OpenAI(
+            base_url=base_url or os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
+            api_key=api_key or os.getenv("VLLM_API_KEY", "EMPTY"),
+        )
+    def summarize(self, context, max_tokens=2048):
+        for attempt in range(6):
+            try:
+                response = self.client.chat.completions.create(
+                    model=self.model_name,
+                    messages=[
+                        {"role": "system", "content": "You are a helpful assistant."},
+                        {
+                            "role": "user",
+                            "content": (
+                                "Write a summary of the following, including as many "
+                                "key details as possible:\n\n"
+                                f"{context}"
+                            ),
+                        },
+                    ],
+                    max_tokens=max_tokens,
+                    temperature=0.3,
+                )
+                content = response.choices[0].message.content if response.choices else None
+                if content is None:
+                    wait = min(2 ** attempt * 2, 30)
+                    print(f"[WARN] LLM returned None content (attempt {attempt+1}); retrying in {wait}s")
+                    time.sleep(wait)
+                    continue
+                return content.strip()
+            except Exception as e:
+                msg = str(e).lower()
+                if any(code in msg for code in ("429", "500", "503", "rate limit")):
+                    wait = min(2 ** attempt * 5, 60)
+                    print(f"[WARN] Summarization retry {attempt+1}/6, sleeping {wait}s: {e}")
+                    time.sleep(wait)
+                    continue
+                print(f"[ERROR] Summarization failed: {e}")
+                raise
+        raise RuntimeError("Summarization failed after 6 retries")
+# ---------------------------------------------------------------------------
+# vLLM-backed QA Model
+# ---------------------------------------------------------------------------
+class VLLMQAModel(BaseQAModel):
+    """QA model backed by a vLLM OpenAI-compatible server.
+    Mutable attributes `question_date` and `user_profile` should be set
+    before each call to include per-question context in the prompt.
+    """
+    def __init__(
+        self,
+        model_name: str = None,
+        base_url: str = None,
+        api_key: str = None,
+    ):
+        from openai import OpenAI
+        self.model_name = (
+            model_name
+            or os.getenv("VLLM_MODEL_NAME")
+            or "Qwen/Qwen3-30B-A3B-Instruct-2507"
+        )
+        self.client = OpenAI(
+            base_url=base_url or os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
+            api_key=api_key or os.getenv("VLLM_API_KEY", "EMPTY"),
+        )
+        # Set these per-question before calling answer_question
+        self.question_date: Optional[str] = None
+        self.user_profile: Optional[str] = None
+    def answer_question(self, context, question):
+        # Build prompt matching the project's answer template (main.py:1490)
+        parts = []
+        parts.append(
+            "I will give you several chat history sessions between you and a user. "
+            "Please answer the question given the information."
+        )
+        if self.user_profile:
+            parts.append(f"\n\nUser Profile:\n{self.user_profile}")
+        parts.append(f"\n\nChat history sessions:\n\n{context}")
+        if self.question_date:
+            parts.append(f"\n\nCurrent Date: {self.question_date}")
+        parts.append(f"\nQuestion: {question}\nAnswer:")
+        prompt = "".join(parts)
+        for attempt in range(6):
+            try:
+                response = self.client.chat.completions.create(
+                    model=self.model_name,
+                    messages=[{"role": "user", "content": prompt}],
+                    max_tokens=8192,
+                    temperature=0.3,
+                )
+                content = response.choices[0].message.content if response.choices else None
+                if content is None:
+                    wait = min(2 ** attempt * 2, 30)
+                    print(f"[WARN] LLM returned None content (attempt {attempt+1}); retrying in {wait}s")
+                    time.sleep(wait)
+                    continue
+                return content.strip()
+            except Exception as e:
+                msg = str(e).lower()
+                if any(code in msg for code in ("429", "500", "503", "rate limit")):
+                    wait = min(2 ** attempt * 5, 60)
+                    print(f"[WARN] QA retry {attempt+1}/6, sleeping {wait}s: {e}")
+                    time.sleep(wait)
+                    continue
+                print(f"[ERROR] QA failed: {e}")
+                raise
+        raise RuntimeError("QA failed after 6 retries")
+# ---------------------------------------------------------------------------
+# Data helpers
+# ---------------------------------------------------------------------------
+def prepare_session_documents(
+    haystack_session_ids: List[str],
+    haystack_dates: List[str],
+    summaries: Dict,
+) -> List[str]:
+    """Format session summaries as RAPTOR leaf documents.
+    Each document is a short block:
+        [Session {sid} | Date: {date}]
+        {session_summary_text}
+    """
+    docs = []
+    for sid, date_str in zip(haystack_session_ids, haystack_dates):
+        summary_data = summaries.get(sid)
+        if summary_data is None:
+            continue
+        text = summary_data.get("session_summary", "")
+        if not text:
+            # Fallback: join turn summaries
+            turn_sums = summary_data.get("turn_summaries", [])
+            if turn_sums:
+                text = " ".join(turn_sums)
+            else:
+                continue
+        doc = f"[Session {sid} | Date: {date_str}]\n{text}"
+        docs.append(doc)
+    return docs
+# ---------------------------------------------------------------------------
+# Tree building / caching
+# ---------------------------------------------------------------------------
+def build_or_load_tree(
+    question_id: str,
+    docs: List[str],
+    tree_builder,
+    tree_cache_dir: str,
+):
+    """Build a RAPTOR tree from docs, or load from cache."""
+    tree_path = os.path.join(tree_cache_dir, f"{question_id}.pkl")
+    if os.path.exists(tree_path):
+        logging.info(f"Loading cached tree for {question_id}")
+        with open(tree_path, "rb") as f:
+            tree = pickle.load(f)
+        return tree
+    # Join docs with double-newline separator so RAPTOR's split_text keeps
+    # each ~230-token summary as a single leaf node (with tb_max_tokens=300).
+    text = "\n\n".join(docs)
+    logging.info(f"Building tree for {question_id} ({len(docs)} docs)")
+    tree = tree_builder.build_from_text(text, use_multithreading=True)
+    os.makedirs(tree_cache_dir, exist_ok=True)
+    with open(tree_path, "wb") as f:
+        pickle.dump(tree, f)
+    logging.info(f"Saved tree for {question_id} -> {tree_path}")
+    return tree
+# ---------------------------------------------------------------------------
+# Retrieval metrics
+# ---------------------------------------------------------------------------
+_SESSION_ID_RE = re.compile(r"\[Session\s+(\S+)\s*\|")
+def extract_session_ids_from_context(context: str) -> List[str]:
+    """Parse session IDs from RAPTOR-retrieved context text.
+    Leaf nodes are formatted as '[Session {sid} | Date: ...]\\n{summary}'.
+    Higher-level nodes are summaries of clusters and won't contain session IDs.
+    """
+    return list(dict.fromkeys(_SESSION_ID_RE.findall(context)))  # unique, order-preserving
+def evaluate_retrieval(recalled_docs, correct_docs):
+    recall_any = float(any(doc in recalled_docs for doc in correct_docs))
+    recall_all = float(all(doc in recalled_docs for doc in correct_docs))
+    return recall_any, recall_all
+def print_average_metrics(retrieval_metric_list):
+    metric_sums = defaultdict(float)
+    metric_counts = defaultdict(int)
+    for metric in retrieval_metric_list:
+        for k, v in metric.items():
+            metric_sums[k] += v
+            metric_counts[k] += 1
+    print("  Average retrieval metrics:")
+    for k in sorted(metric_sums):
+        avg = metric_sums[k] / metric_counts[k]
+        print(f"    {k}: {avg:.4f}")
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="RAPTOR baseline for EvolV-Mem")
+    parser.add_argument("--in_file", type=str, required=True,
+                        help="Path to evolv_mem_v4.json")
+    parser.add_argument("--out_file", type=str, required=True,
+                        help="Output JSONL file")
+    parser.add_argument("--summary_file", type=str, required=True,
+                        help="Path to all_session_summary.json")
+    parser.add_argument("--profile_file", type=str, default=None,
+                        help="Path to generated_user_profile.json")
+    parser.add_argument("--tree_cache_dir", type=str,
+                        default="baselines/raptor/trees",
+                        help="Directory to cache built trees")
+    # RAPTOR tree builder params
+    parser.add_argument("--tb_max_tokens", type=int, default=300,
+                        help="Max tokens per leaf chunk (default 300)")
+    parser.add_argument("--tb_num_layers", type=int, default=3,
+                        help="Number of tree layers (default 3)")
+    parser.add_argument("--tb_summarization_length", type=int, default=200,
+                        help="Max tokens per cluster summary (default 200)")
+    # RAPTOR retrieval params
+    parser.add_argument("--tr_top_k", type=int, default=10,
+                        help="Top-k nodes to retrieve (default 10)")
+    parser.add_argument("--max_retrieval_tokens", type=int, default=8000,
+                        help="Token budget for retrieved context (default 8000)")
+    # Embedding model
+    parser.add_argument("--embedding_model", type=str,
+                        default="sentence-transformers/multi-qa-mpnet-base-cos-v1",
+                        help="SentenceTransformer model for embeddings")
+    # Index range (for parallel jobs)
+    parser.add_argument("--start_idx", type=int, default=None,
+                        help="Start index (inclusive) for question subset")
+    parser.add_argument("--end_idx", type=int, default=None,
+                        help="End index (exclusive) for question subset")
+    # Limit (for debugging)
+    parser.add_argument("--limit", type=int, default=None,
+                        help="Process only the first N questions")
+    args = parser.parse_args()
+    # -----------------------------------------------------------------------
+    # Load data
+    # -----------------------------------------------------------------------
+    print(f"Loading benchmark from {args.in_file} ...")
+    with open(args.in_file) as f:
+        benchmark = json.load(f)
+    if args.start_idx is not None or args.end_idx is not None:
+        s = args.start_idx or 0
+        e = args.end_idx or len(benchmark)
+        benchmark = benchmark[s:e]
+        print(f"  Using index range [{s}, {e})")
+    if args.limit:
+        benchmark = benchmark[: args.limit]
+    print(f"  {len(benchmark)} questions loaded.")
+    print(f"Loading session summaries from {args.summary_file} ...")
+    with open(args.summary_file) as f:
+        summaries = json.load(f)
+    print(f"  {len(summaries)} sessions loaded.")
+    profiles = {}
+    if args.profile_file and os.path.exists(args.profile_file):
+        print(f"Loading user profiles from {args.profile_file} ...")
+        with open(args.profile_file) as f:
+            profiles = json.load(f)
+        print(f"  {len(profiles)} profiles loaded.")
+    # -----------------------------------------------------------------------
+    # Resume support: load existing output
+    # -----------------------------------------------------------------------
+    existing_qids = set()
+    if os.path.exists(args.out_file):
+        with open(args.out_file) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    obj = json.loads(line)
+                    existing_qids.add(obj["question_id"])
+        print(f"  Resuming: {len(existing_qids)} questions already processed.")
+    # -----------------------------------------------------------------------
+    # Initialize models
+    # -----------------------------------------------------------------------
+    print("Initializing models ...")
+    embedding_model = SBertEmbeddingModel(model_name=args.embedding_model)
+    summarization_model = VLLMSummarizationModel()
+    qa_model = VLLMQAModel()
+    # -----------------------------------------------------------------------
+    # Build RAPTOR config
+    # -----------------------------------------------------------------------
+    config = RetrievalAugmentationConfig(
+        summarization_model=summarization_model,
+        qa_model=qa_model,
+        embedding_model=embedding_model,
+        tb_max_tokens=args.tb_max_tokens,
+        tb_num_layers=args.tb_num_layers,
+        tb_summarization_length=args.tb_summarization_length,
+        tr_top_k=args.tr_top_k,
+    )
+    # Pre-create tree builder (reused across questions)
+    tree_builder = config.tree_builder_config
+    # We need the actual builder instance from a fresh RA to reuse
+    ra_template = RetrievalAugmentation(config=config)
+    tree_builder_instance = ra_template.tree_builder
+    os.makedirs(args.tree_cache_dir, exist_ok=True)
+    # -----------------------------------------------------------------------
+    # Process questions
+    # -----------------------------------------------------------------------
+    retrieval_metric_list = []
+    out_f = open(args.out_file, "a")
+    for di, entry in enumerate(tqdm(benchmark, desc="RAPTOR baseline")):
+        qid = entry["question_id"]
+        question = entry["question"]
+        question_date = entry["question_date"]
+        if qid in existing_qids:
+            continue
+        try:
+            # 1. Prepare documents from session summaries
+            docs = prepare_session_documents(
+                entry["haystack_session_ids"],
+                entry["haystack_dates"],
+                summaries,
+            )
+            if not docs:
+                print(f"[WARN] q_idx={di} qid={qid}: no session summaries found, skipping.")
+                result = {
+                    "q_idx": di,
+                    "question_id": qid,
+                    "hypothesis": "Insufficient information to answer.",
+                    "n_docs": 0,
+                }
+                print(json.dumps(result), file=out_f, flush=True)
+                continue
+            # 2. Build or load RAPTOR tree
+            tree = build_or_load_tree(
+                qid, docs, tree_builder_instance, args.tree_cache_dir
+            )
+            # 3. Create RA instance with this tree
+            ra = RetrievalAugmentation(config=config, tree=tree)
+            # 4. Set per-question context on the QA model
+            qa_model.question_date = question_date
+            user_id = qid.split("_q_")[0] if "_q_" in qid else qid
+            qa_model.user_profile = profiles.get(user_id, None)
+            # 5. Retrieve context (separate from QA so we can extract session IDs)
+            context, layer_info = ra.retrieve(
+                question=question,
+                top_k=args.tr_top_k,
+                max_tokens=args.max_retrieval_tokens,
+                collapse_tree=True,
+                return_layer_information=True,
+            )
+            # 5a. Extract retrieved session IDs from context text
+            retrieved_session_ids = extract_session_ids_from_context(context)
+            # 5b. Generate answer manually
+            answer = qa_model.answer_question(context, question)
+            # 5c. Compute retrieval metrics
+            answer_session_ids = entry.get("answer_session_ids", [])
+            retrieval_metric = {}
+            if answer_session_ids and retrieved_session_ids:
+                for topk in [5, 10, 20, 30]:
+                    r_any, r_all = evaluate_retrieval(
+                        retrieved_session_ids[:topk], answer_session_ids
+                    )
+                    retrieval_metric[f"recall_any@{topk}"] = r_any
+                    retrieval_metric[f"recall_all@{topk}"] = r_all
+                retrieval_metric_list.append(retrieval_metric)
+                print_average_metrics(retrieval_metric_list)
+            # 6. Write output
+            result = {
+                "q_idx": di,
+                "question_id": qid,
+                "hypothesis": answer,
+                "n_docs": len(docs),
+                "n_tree_nodes": len(tree.all_nodes) if hasattr(tree, "all_nodes") else -1,
+                "n_tree_layers": tree.num_layers if hasattr(tree, "num_layers") else -1,
+                "retrieved_session_ids": retrieved_session_ids,
+                "retrieval_metric": retrieval_metric,
+            }
+            print(json.dumps(result), file=out_f, flush=True)
+            print(
+                f"[{di}/{len(benchmark)}] qid={qid} | "
+                f"docs={len(docs)} | nodes={result['n_tree_nodes']} | "
+                f"layers={result['n_tree_layers']} | "
+                f"retrieved_sessions={len(retrieved_session_ids)}"
+            )
+            print(f"  Q: {question[:100]}...")
+            print(f"  A: {answer[:200]}...")
+        except Exception as e:
+            print(f"[ERROR] q_idx={di} qid={qid} failed: {e}", flush=True)
+            import traceback
+            traceback.print_exc()
+            continue
+    out_f.close()
+    print(f"\nDone. Results saved to {args.out_file}")
+if __name__ == "__main__":
+    main()

baselines/read-agent/read_agent_demo.ipynb ADDED Viewed

	@@ -0,0 +1,976 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "![read_agent_teaser](https://read-agent.github.io/img/teaser.png)"
+      ],
+      "metadata": {
+        "id": "1iqyV7VcsiXT"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "MYOnCMh83ZRE"
+      },
+      "outputs": [],
+      "source": [
+        "!wget https://github.com/nyu-mll/quality/raw/main/data/v1.0.1/QuALITY.v1.0.1.htmlstripped.dev\n",
+        "import re, time, datetime, json, string, copy"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title Using OpenAI GPT model (DO NOT run the next cell if using GPT)\n",
+        "!pip3 install openai\n",
+        "import openai\n",
+        "\n",
+        "key = 'YOUR API KEY'  #@param {type: \"string\"}\n",
+        "gpt_client = openai.OpenAI(api_key=key)\n",
+        "model_type = 'gpt'\n",
+        "\n",
+        "def query_gpt_model(\n",
+        "    prompt: str,\n",
+        "    lm: str = 'gpt-3.5-turbo-1106',\n",
+        "    temperature: float = 0.0,\n",
+        "    max_decode_steps: int = 512,\n",
+        "    seconds_to_reset_tokens: float = 30.0,\n",
+        ") -> str:\n",
+        "  while True:\n",
+        "    try:\n",
+        "      raw_response = gpt_client.chat.completions.with_raw_response.create(\n",
+        "        model=lm,\n",
+        "        max_tokens=max_decode_steps,\n",
+        "        temperature=temperature,\n",
+        "        messages=[\n",
+        "          {'role': 'user', 'content': prompt},\n",
+        "        ]\n",
+        "      )\n",
+        "      completion = raw_response.parse()\n",
+        "      return completion.choices[0].message.content\n",
+        "    except openai.RateLimitError as e:\n",
+        "      print(f'{datetime.datetime.now()}: query_gpt_model: RateLimitError {e.message}: {e}')\n",
+        "      time.sleep(seconds_to_reset_tokens)\n",
+        "    except openai.APIError as e:\n",
+        "      print(f'{datetime.datetime.now()}: query_gpt_model: APIError {e.message}: {e}')\n",
+        "      print(f'{datetime.datetime.now()}: query_gpt_model: Retrying after 5 seconds...')\n",
+        "      time.sleep(5)"
+      ],
+      "metadata": {
+        "id": "oz0kOxYJ4n3e",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title Using Google Gemini model (DO NOT run this if using GPT)\n",
+        "!pip3 install -q -U google-generativeai\n",
+        "import google.generativeai as genai\n",
+        "\n",
+        "key = 'YOUR API KEY'  #@param {type: \"string\"}\n",
+        "\n",
+        "genai.configure(api_key=key)\n",
+        "model = genai.GenerativeModel('gemini-pro')\n",
+        "model_type = 'gemini'\n",
+        "\n",
+        "def query_gemini_model(\n",
+        "    prompt: str,\n",
+        "    retries: int = 10,\n",
+        ") -> str:\n",
+        "  while True and retries > 0:\n",
+        "    try:\n",
+        "      response = model.generate_content(prompt)\n",
+        "      text_response = response.text.replace(\"**\", \"\")\n",
+        "      return text_response\n",
+        "    except Exception as e:\n",
+        "      print(f'{datetime.datetime.now()}: query_gemini_model: Error: {e}')\n",
+        "      print(f'{datetime.datetime.now()}: query_gemini_model: Retrying after 5 seconds...')\n",
+        "      retries -= 1\n",
+        "      time.sleep(5)"
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "YcP_tIpZKNFY"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def query_model(prompt):\n",
+        "  if model_type == \"gpt\":\n",
+        "    return query_gpt_model(prompt)\n",
+        "  elif model_type == \"gemini\":\n",
+        "    return query_gemini_model(prompt)"
+      ],
+      "metadata": {
+        "id": "pYm2GsBGEvAI"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title Load a QuALITY example\n",
+        "\n",
+        "# Fields that are straight text copies from raw example to processed example.\n",
+        "_ONE2ONE_FIELDS = (\n",
+        "    'article',\n",
+        "    'article_id',\n",
+        "    'set_unique_id',\n",
+        "    'writer_id',\n",
+        "    'source',\n",
+        "    'title',\n",
+        "    'topic',\n",
+        "    'url',\n",
+        "    'writer_id',\n",
+        "    'author',\n",
+        ")\n",
+        "\n",
+        "quality_dev = []\n",
+        "\n",
+        "with open('QuALITY.v1.0.1.htmlstripped.dev', 'r') as f:\n",
+        "  for line in f.readlines():\n",
+        "    j = json.loads(line)\n",
+        "    fields = {k: j[k] for k in _ONE2ONE_FIELDS}\n",
+        "    fields.update({\n",
+        "        'questions': [q['question'] for q in j['questions']],\n",
+        "        'question_ids': [q['question_unique_id'] for q in j['questions']],\n",
+        "        'difficults': [q['difficult'] for q in j['questions']],\n",
+        "        'options': [q['options'] for q in j['questions']],\n",
+        "    })\n",
+        "\n",
+        "    fields.update({\n",
+        "        'gold_labels': [q['gold_label'] for q in j['questions']],\n",
+        "        'writer_labels': [q['writer_label'] for q in j['questions']],\n",
+        "      })\n",
+        "\n",
+        "    quality_dev.append(fields)\n",
+        "\n",
+        "example = quality_dev[13]"
+      ],
+      "metadata": {
+        "id": "1B70Rqg97aXu",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title Helper functions\n",
+        "\n",
+        "all_lowercase_letters = string.ascii_lowercase  # \"abcd...xyz\"\n",
+        "bracketed_lowercase_letters_set = set(\n",
+        "    [f\"({l})\" for l in all_lowercase_letters]\n",
+        ")  # {\"(a)\", ...}\n",
+        "bracketed_uppercase_letters_set = set(\n",
+        "    [f\"({l.upper()})\" for l in all_lowercase_letters]\n",
+        ")  # {\"(a)\", ...}\n",
+        "\n",
+        "choices = ['(A)', '(B)', '(C)', '(D)']\n",
+        "\n",
+        "def get_index_from_symbol(answer):\n",
+        "  \"\"\"Get the index from the letter symbols A, B, C, D, to extract answer texts.\n",
+        "\n",
+        "  Args:\n",
+        "    answer (str): the string of answer like \"(B)\".\n",
+        "\n",
+        "  Returns:\n",
+        "    index (int): how far the given choice is from \"a\", like 1 for answer \"(B)\".\n",
+        "  \"\"\"\n",
+        "  answer = str(answer).lower()\n",
+        "  # extract the choice letter from within bracket\n",
+        "  if answer in bracketed_lowercase_letters_set:\n",
+        "    answer = re.findall(r\"\\(.*?\\)\", answer)[0][1]\n",
+        "  index = ord(answer) - ord(\"a\")\n",
+        "  return index\n",
+        "\n",
+        "def count_words(text):\n",
+        "  \"\"\"Simple word counting.\"\"\"\n",
+        "  return len(text.split())\n",
+        "\n",
+        "def quality_gutenberg_parser(raw_article):\n",
+        "  \"\"\"Parse Gutenberg articles in the QuALITY dataset.\"\"\"\n",
+        "  lines = []\n",
+        "  previous_line = None\n",
+        "  for i, line in enumerate(raw_article.split('\\n')):\n",
+        "    line = line.strip()\n",
+        "    original_line = line\n",
+        "    if line == '':\n",
+        "      if previous_line == '':\n",
+        "        line = '\\n'\n",
+        "      else:\n",
+        "        previous_line = original_line\n",
+        "        continue\n",
+        "    previous_line = original_line\n",
+        "    lines.append(line)\n",
+        "  return ' '.join(lines)"
+      ],
+      "metadata": {
+        "id": "nQsb3n6pOlz2",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ReadAgent (1) Episode Pagination\n",
+        "\n",
+        "prompt_pagination_template = \"\"\"\n",
+        "You are given a passage that is taken from a larger text (article, book, ...) and some numbered labels between the paragraphs in the passage.\n",
+        "Numbered label are in angeled brackets. For example, if the label number is 19, it shows as <19> in text.\n",
+        "Please choose one label that it is natural to break reading.\n",
+        "Such point can be scene transition, end of a dialogue, end of an argument, narrative transition, etc.\n",
+        "Please answer the break point label and explain.\n",
+        "For example, if <57> is a good point to break, answer with \\\"Break point: <57>\\n Because ...\\\"\n",
+        "\n",
+        "Passage:\n",
+        "\n",
+        "{0}\n",
+        "{1}\n",
+        "{2}\n",
+        "\n",
+        "\"\"\"\n",
+        "\n",
+        "def parse_pause_point(text):\n",
+        "  text = text.strip(\"Break point: \")\n",
+        "  if text[0] != '<':\n",
+        "    return None\n",
+        "  for i, c in enumerate(text):\n",
+        "    if c == '>':\n",
+        "      if text[1:i].isnumeric():\n",
+        "        return int(text[1:i])\n",
+        "      else:\n",
+        "        return None\n",
+        "  return None\n",
+        "\n",
+        "\n",
+        "def quality_pagination(example,\n",
+        "                       word_limit=600,\n",
+        "                       start_threshold=280,\n",
+        "                       max_retires=10,\n",
+        "                       verbose=True,\n",
+        "                       allow_fallback_to_last=True):\n",
+        "  article = example['article']\n",
+        "  title = example['title']\n",
+        "  print(f\"[Pagination][Article {title}]\")\n",
+        "  paragraphs = quality_gutenberg_parser(article).split('\\n')\n",
+        "\n",
+        "  i = 0\n",
+        "  pages = []\n",
+        "  while i < len(paragraphs):\n",
+        "    preceding = \"\" if i == 0 else \"...\\n\" + '\\n'.join(pages[-1])\n",
+        "    passage = [paragraphs[i]]\n",
+        "    wcount = count_words(paragraphs[i])\n",
+        "    j = i + 1\n",
+        "    while wcount < word_limit and j < len(paragraphs):\n",
+        "      wcount += count_words(paragraphs[j])\n",
+        "      if wcount >= start_threshold:\n",
+        "        passage.append(f\"<{j}>\")\n",
+        "      passage.append(paragraphs[j])\n",
+        "      j += 1\n",
+        "    passage.append(f\"<{j}>\")\n",
+        "    end_tag = \"\" if j == len(paragraphs) else paragraphs[j] + \"\\n...\"\n",
+        "\n",
+        "    pause_point = None\n",
+        "    if wcount < 350:\n",
+        "      pause_point = len(paragraphs)\n",
+        "    else:\n",
+        "      prompt = prompt_pagination_template.format(preceding, '\\n'.join(passage), end_tag)\n",
+        "      response = query_model(prompt=prompt).strip()\n",
+        "      pause_point = parse_pause_point(response)\n",
+        "      if pause_point and (pause_point <= i or pause_point > j):\n",
+        "        print(f\"prompt:\\n{prompt},\\nresponse:\\n{response}\\n\")\n",
+        "        print(f\"i:{i} j:{j} pause_point:{pause_point}\")\n",
+        "        pause_point = None\n",
+        "      if pause_point is None:\n",
+        "        if allow_fallback_to_last:\n",
+        "          pause_point = j\n",
+        "        else:\n",
+        "          raise ValueError(f\"prompt:\\n{prompt},\\nresponse:\\n{response}\\n\")\n",
+        "\n",
+        "    page = paragraphs[i:pause_point]\n",
+        "    pages.append(page)\n",
+        "    if verbose:\n",
+        "      print(f\"Paragraph {i}-{pause_point-1}\", page)\n",
+        "    i = pause_point\n",
+        "  print(f\"[Pagination] Done with {len(pages)} pages\")\n",
+        "  return pages\n",
+        "\n",
+        "pages = quality_pagination(example)"
+      ],
+      "metadata": {
+        "id": "BfFkEQKx0u9U"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ReadAgent (2) Memory Gisting\n",
+        "\n",
+        "prompt_shorten_template = \"\"\"\n",
+        "Please shorten the following passage.\n",
+        "Just give me a shortened version. DO NOT explain your reason.\n",
+        "\n",
+        "Passage:\n",
+        "{}\n",
+        "\n",
+        "\"\"\"\n",
+        "\n",
+        "def quality_gisting(example, pages, word_limit=600, start_threshold=280, verbose=True):\n",
+        "  article = example['article']\n",
+        "  title = example['title']\n",
+        "  word_count = count_words(article)\n",
+        "  print(f\"[Gisting][Article {title}], {word_count} words\")\n",
+        "\n",
+        "  shortened_pages = []\n",
+        "  for i, page in enumerate(pages):\n",
+        "    prompt = prompt_shorten_template.format('\\n'.join(page))\n",
+        "    response = query_model(prompt)\n",
+        "    shortened_text = response.strip()\n",
+        "    shortened_pages.append(shortened_text)\n",
+        "    if verbose:\n",
+        "      print(\"[gist] page {}:\".format(i), shortened_text, flush=True)\n",
+        "  shortened_article = '\\n'.join(shortened_pages)\n",
+        "  gist_word_count = count_words(shortened_article)\n",
+        "  if verbose:\n",
+        "    print(\"Shortened article:\\n\", shortened_article, flush=True)\n",
+        "  output = copy.deepcopy(example)\n",
+        "  output.update({'title': title, 'word_count': word_count, 'gist_word_count': gist_word_count, 'shortened_pages': shortened_pages, 'pages': pages})\n",
+        "  if verbose:\n",
+        "    print(f\"compression rate {round(100.0 - gist_word_count/word_count*100, 2)}% ({gist_word_count}/{word_count})\")\n",
+        "  return output\n",
+        "example_with_gists = quality_gisting(example, pages)"
+      ],
+      "metadata": {
+        "id": "DLBolKnkS_9y"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ReadAgent (3) Look-Up\n",
+        "\n",
+        "prompt_lookup_template = \"\"\"\n",
+        "The following text is what you remembered from reading an article and a multiple choice question related to it.\n",
+        "You may read 1 to 6 page(s) of the article again to refresh your memory to prepare yourselve for the question.\n",
+        "Please respond with which page(s) you would like to read.\n",
+        "For example, if your only need to read Page 8, respond with \\\"I want to look up Page [8] to ...\\\";\n",
+        "if your would like to read Page 7 and 12, respond with \\\"I want to look up Page [7, 12] to ...\\\";\n",
+        "if your would like to read Page 2, 3, 7, 15 and 18, respond with \\\"I want to look up Page [2, 3, 7, 15, 18] to ...\\\".\n",
+        "if your would like to read Page 3, 4, 5, 12, 13 and 16, respond with \\\"I want to look up Page [3, 3, 4, 12, 13, 16] to ...\\\".\n",
+        "DO NOT select more pages if you don't need to.\n",
+        "DO NOT answer the question yet.\n",
+        "\n",
+        "Text:\n",
+        "{}\n",
+        "\n",
+        "Question:\n",
+        "{}\n",
+        "{}\n",
+        "\n",
+        "Take a deep breath and tell me: Which page(s) would you like to read again?\n",
+        "\"\"\"\n",
+        "\n",
+        "prompt_answer_template = \"\"\"\n",
+        "Read the following article and answer a multiple choice question.\n",
+        "For example, if (C) is correct, answer with \\\"Answer: (C) ...\\\"\n",
+        "\n",
+        "Article:\n",
+        "{}\n",
+        "\n",
+        "Question:\n",
+        "{}\n",
+        "{}\n",
+        "\n",
+        "\"\"\"\n",
+        "\n",
+        "def quality_parallel_lookup(example, verbose=True):\n",
+        "  preprocessed_pages = example['pages']\n",
+        "  article = example['article']\n",
+        "  title = example['title']\n",
+        "  word_count = example['word_count']\n",
+        "  gist_word_count = example['gist_word_count']\n",
+        "  pages = example['pages']\n",
+        "  shortened_pages = example['shortened_pages']\n",
+        "  questions = example['questions']\n",
+        "  options = example['options']\n",
+        "  gold_labels = example['gold_labels']  # numerical [1, 2, 3, 4]\n",
+        "\n",
+        "  print(f\"[Look-Up][Article {title}] {word_count} words\")\n",
+        "\n",
+        "  model_choices = []\n",
+        "  lookup_page_ids = []\n",
+        "\n",
+        "  shortened_pages_pidx = []\n",
+        "  for i, shortened_text in enumerate(shortened_pages):\n",
+        "    shortened_pages_pidx.append(\"<Page {}>\\n\".format(i) + shortened_text)\n",
+        "  shortened_article = '\\n'.join(shortened_pages_pidx)\n",
+        "\n",
+        "  expanded_gist_word_counts = []\n",
+        "  for i, label in enumerate(gold_labels):\n",
+        "    # only test the first question for demo\n",
+        "    if i != 1:\n",
+        "      continue\n",
+        "    q = questions[i]\n",
+        "    print(\"question: \", q)\n",
+        "    options_i = [f\"{ol} {o}\" for ol, o in zip(choices, options[i])]\n",
+        "    print(\"options: \", \"\\n\".join(options_i))\n",
+        "    prompt_lookup = prompt_lookup_template.format(shortened_article, q, '\\n'.join(options_i))\n",
+        "\n",
+        "    page_ids = []\n",
+        "\n",
+        "    response = query_model(prompt=prompt_lookup).strip()\n",
+        "\n",
+        "    try: start = response.index('[')\n",
+        "    except ValueError: start = len(response)\n",
+        "    try: end = response.index(']')\n",
+        "    except ValueError: end = 0\n",
+        "    if start < end:\n",
+        "      page_ids_str = response[start+1:end].split(',')\n",
+        "      page_ids = []\n",
+        "      for p in page_ids_str:\n",
+        "        if p.strip().isnumeric():\n",
+        "          page_id = int(p)\n",
+        "          if page_id < 0 or page_id >= len(pages):\n",
+        "            print(\"Skip invalid page number: \", page_id, flush=True)\n",
+        "          else:\n",
+        "            page_ids.append(page_id)\n",
+        "\n",
+        "    if verbose:\n",
+        "      print(\"Model chose to look up page {}\".format(page_ids))\n",
+        "\n",
+        "    # Memory expansion after look-up, replacing the target shortened page with the original page\n",
+        "    expanded_shortened_pages = shortened_pages[:]\n",
+        "    if len(page_ids) > 0:\n",
+        "      for page_id in page_ids:\n",
+        "        expanded_shortened_pages[page_id] = '\\n'.join(pages[page_id])\n",
+        "\n",
+        "    expanded_shortened_article = '\\n'.join(expanded_shortened_pages)\n",
+        "    expanded_gist_word_count = count_words(expanded_shortened_article)\n",
+        "    if verbose:\n",
+        "      print(\"Expanded shortened article:\\n\", expanded_shortened_article, flush=True)\n",
+        "    prompt_answer = prompt_answer_template.format(expanded_shortened_article, q, '\\n'.join(options_i))\n",
+        "\n",
+        "    # If the response doesn't follow the template, retry\n",
+        "    model_choice = None\n",
+        "    response = query_model(prompt=prompt_answer)\n",
+        "    response = response.strip()\n",
+        "    for j, choice in enumerate(choices):\n",
+        "      if response.startswith(f\"Answer: {choice}\") or response.startswith(f\"Answer: {choice[1]}\"):\n",
+        "        model_choice = j+1\n",
+        "        break\n",
+        "    is_correct = 1 if model_choice == label else 0\n",
+        "    print(f\"question: {q}\")\n",
+        "    print(f\"reference answer: {choices[label]}, model prediction: {choices[model_choice]}, is_correct: {is_correct}\")\n",
+        "    print(f\"compression rate {round(100.0 - gist_word_count/word_count*100, 2)}% ({gist_word_count}/{word_count})\")\n",
+        "    print(f\"compression rate after look-up {round(100.0 - expanded_gist_word_count/word_count*100, 2)}% ({expanded_gist_word_count}/{word_count})\")\n",
+        "\n",
+        "quality_parallel_lookup(example_with_gists)"
+      ],
+      "metadata": {
+        "id": "8YKNTyDsXNIn"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#Prompts that we used in the paper\n",
+        "\n",
+        "In the following we show the prompts that were used for the QuALTIY, QMSum, NarrativeQA datasets with the PaLM 2-L model. While there are slight differences in prompt design, most of these are not due to optimizing prompts for specific datasets but rather a results of that each author wrote the prompts independently."
+      ],
+      "metadata": {
+        "id": "Gn8fjomx7iRz"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title The prompts we used for QuALITY with PaLM 2-L\n",
+        "\n",
+        "\n",
+        "# Pagination\n",
+        "pagination_prompt_template = \"\"\"\n",
+        "You are given a passage that is taken from a larger text (article, book, ...) and some numbered labels between the paragraphs in the passage.\n",
+        "Numbered label are in angeled brackets. For example, if the label number is 19, it shows as <19> in text.\n",
+        "Please choose one label that it is natural to break reading.\n",
+        "Such point can be scene transition, end of a dialogue, end of an argument, narrative transition, etc.\n",
+        "Please answer the break point label and explain.\n",
+        "For example, if <57> is a good point to break, answer with \\\"Break point: <57>\\n Because ...\\\"\n",
+        "\n",
+        "Passage:\n",
+        "\n",
+        "{passage_text}\n",
+        "{end_tag}\n",
+        "\n",
+        "\"\"\"\n",
+        "# passage_text: a chunk of text.\n",
+        "# end_tag: a string, whose value is \"\" if the text is at the end of the article, and otherwise \"\\n...\".\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Gisting\n",
+        "gisting_prompt_template = \"\"\"\n",
+        "Please shorten the following passage.\n",
+        "Just give me a shortened version. DO NOT explain your reason.\n",
+        "\n",
+        "Passage:\n",
+        "{page_text}\n",
+        "\n",
+        "\"\"\"\n",
+        "# page_text: a page of text\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Parallel Look-up (ReadAgent-P, up to 5 pages)\n",
+        "parallel_lookup_prompt_template = \"\"\"\n",
+        "The following text is what you remembered from reading an article and a multiple choice question related to it.\n",
+        "You may read 1 to 5 page(s) of the article again to refresh your memory to prepare yourselve for the question.\n",
+        "Please respond with which page(s) you would like to read again.\n",
+        "For example, if your would like to only read Page 8, respond with \\\"I want to look up Page [8] to ...\\\";\n",
+        "if your would like to read Page 7 and 12, respond with \\\"I want to look up Page [7, 12] to ...\\\";\n",
+        "if your would like to read Page 2, 3, 7, 15 and 18, respond with \\\"I want to look up Page [2, 3, 7, 15, 18] to ...\\\".\n",
+        "DO NOT select more pages if you don't need to.\n",
+        "DO NOT answer the question yet.\n",
+        "\n",
+        "Text:\n",
+        "{concatenated_gists}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "{options}\n",
+        "\n",
+        "Take a deep breath and tell me: Which page(s) would you like to read again?\n",
+        "\"\"\"\n",
+        "# concatenated_gists: concatenated gists\n",
+        "# question: a question\n",
+        "# options: multiple-choice options\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Sequential Look-up (ReadAgent-S, up to 5 pages)\n",
+        "sequential_lookup_prompt_template = \"\"\"\n",
+        "The following text is what you remember from reading an article, followed by a question about the article.\n",
+        "You may read multiple pages of the article again to refresh your memory and prepare to answer the question.\n",
+        "Each page that you re-read can significantly improve your chance of answering the question correctly.\n",
+        "Please specify a SINGLE page you would like to read again or say \"STOP\".\n",
+        "To read a page again, respond with \"Page $PAGE_NUM\", replacing $PAGE_NUM with the target page number.\n",
+        "You can only specify a SINGLE page in your response at this time.\n",
+        "DO NOT select more pages if you don't need to.\n",
+        "To stop, simply say \"STOP\".\n",
+        "DO NOT answer the question in your response.\n",
+        "\n",
+        "Text:\n",
+        "{concatenated_gists}\n",
+        "End of text.\n",
+        "\n",
+        "Pages re-read already (DO NOT ask to read them again):\n",
+        "{past_page_numbers}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "{options}\n",
+        "\n",
+        "Specify a SINGLE page to read again, or say STOP:\n",
+        "\"\"\"\n",
+        "# concatenated_gists: concatenated gists\n",
+        "# past_page_numbers: page numbers that have already been retrieved\n",
+        "# question: a question\n",
+        "# options: options\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Response/Answer\n",
+        "answer_prompt_template = \"\"\"\n",
+        "Read the following article and answer a multiple choice question.\n",
+        "For example, if (C) is correct, answer with \\\"Answer: (C) ...\\\"\n",
+        "\n",
+        "Article:\n",
+        "{concatenated_pages_and_gists}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "{options}\n",
+        "\n",
+        "\"\"\"\n",
+        "# concatenated_pages_and_gists: concatenated raw pages and gists\n",
+        "# question: a question\n",
+        "# options: options"
+      ],
+      "metadata": {
+        "id": "PGvYRIpO3J3Y"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title The prompts we used for QMSum with PaLM 2-L\n",
+        "\n",
+        "\n",
+        "# Pagination\n",
+        "pagination_prompt_template = \"\"\"\n",
+        "You are given a passage that is taken from a larger meeting transcript.\n",
+        "There are some numbered labels between the paragraphs (like <0>) in the passage.\n",
+        "Please choose one label at a natural transition in the passage.\n",
+        "For example, the label can be at the end of a dialogue, the end of an argument, a change in the topic being discussed, etc.\n",
+        "Please respond with the label and explain your choice.\n",
+        "For example, if <57> is a natural transition, answer with \"Label: <57>\\n Because ...\"\n",
+        "\n",
+        "Passage:\n",
+        "\n",
+        "{preceding_text}\n",
+        "{passage_text}\n",
+        "{end_tag}\n",
+        "\n",
+        "\"\"\"\n",
+        "# preceding_text: a fraction of previous context\n",
+        "# passage_text: a chunk of text.\n",
+        "# end_tag: a string, whose value is \"\" if the text is at the end of the article, and otherwise \"\\n...\".\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Gisting\n",
+        "gisting_prompt_template = \"\"\"\n",
+        "Please shorten the following passage.\n",
+        "Just give a shortened version. DO NOT explain your reasoning.\n",
+        "\n",
+        "Passage:\n",
+        "{page_text}\n",
+        "\n",
+        "\"\"\"\n",
+        "# page_text: a page of text\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Parallel Look-up (ReadAgent-P, up to 2 pages)\n",
+        "parallel_lookup_prompt_template = \"\"\"\n",
+        "The following text is what you remember from reading a meeting transcript, followed by a question about the transcript.\n",
+        "You may read 1 or 2 pages of the transcript again to refresh your memory to prepare to answer the question.\n",
+        "Please respond with which page(s) you would like to read.\n",
+        "For example, if your would only like to read Page 8, respond with \"I want to look up Page [8] ...\"\n",
+        "If you would like to read Page 7 and 12, respond with \"I want to look up Page [7, 12] ...\".\n",
+        "Only select as many pages as you need, but no more than 2 pages.\n",
+        "Don't answer the question yet.\n",
+        "\n",
+        "Text:\n",
+        "{text}\n",
+        "End of text.\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Which page(s) would you like to look up?\n",
+        "\"\"\"\n",
+        "# concatenated_gists: Concatenated gists\n",
+        "# question: a question\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Sequential Look-up (ReadAgent-S)\n",
+        "sequential_lookup_prompt_template = \"\"\"\n",
+        "The following text is what you remember from reading a meeting transcript, followed by a question about the transcript.\n",
+        "You may read multiple pages of the transcript again to refresh your memory and prepare to answer the question.\n",
+        "Each page that you re-read can significantly improve your chance of answering the question correctly.\n",
+        "Please specify a SINGLE page you would like to read again or say \"STOP\".\n",
+        "To read a page again, respond with \"Page $PAGE_NUM\", replacing $PAGE_NUM with the target page number.\n",
+        "You can only specify a SINGLE page in your response at this time.\n",
+        "DO NOT select more pages if you don't need to.\n",
+        "To stop, simply say \"STOP\".\n",
+        "DO NOT answer the question in your response.\n",
+        "\n",
+        "Text:\n",
+        "{concatenated_gists}\n",
+        "End of text.\n",
+        "\n",
+        "Pages re-read already (DO NOT ask to read them again):\n",
+        "{past_page_numbers}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Specify a SINGLE page to read again, or say STOP:\n",
+        "\"\"\"\n",
+        "# concatenated_gists: concatenated gists\n",
+        "# past_page_numbers: page numbers that have already been retrieved\n",
+        "# question: a question\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Response/Answer\n",
+        "answer_prompt_template = \"\"\"\n",
+        "Read the question and text below and then answer the question.\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Text:\n",
+        "{concatenated_pages_and_gists}\n",
+        "End of Text.\n",
+        "\n",
+        "Answer the question based on the above passage and retrieved pages. Your answer should be short and concise.\n",
+        "\"\"\"\n",
+        "# question: a question\n",
+        "# concatenated_pages_and_gists: concatenated raw pages and gists"
+      ],
+      "metadata": {
+        "id": "Ya48p13EhhhI"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title The prompts we used for NarrativeQA - Gutenburg with PaLM 2-L\n",
+        "\n",
+        "\n",
+        "# Pagination\n",
+        "pagination_prompt_template = \"\"\"\n",
+        "You are given a passage that is taken from a larger text (article, book, ...) and some numbered labels between the paragraphs in the passage.\n",
+        "Numbered label are in angeled brackets. For example, if the label number is 19, it shows as <19> in text.\n",
+        "Please choose one label that marks a major section break point.\n",
+        "Such points can be the beginning/end of a book, beginning/end of a chapter, end of a content table, a scene transition, end of a dialogue, etc.\n",
+        "If a point is chosen for the beginning of a book/chapter/etc and there is a title of the new book/chapter/etc, the break point must be chosen at a position right before the section number and title, not after.\n",
+        "\n",
+        "Please answer the break point label and explain.\n",
+        "For example, if <57> is a good point to break, answer with \\\"Breakpoint: <57> ...\\\"\n",
+        "\n",
+        "Text:\n",
+        "\n",
+        "{preceding_text}\n",
+        "{passage_text}\n",
+        "{end_tag}\n",
+        "\n",
+        "\"\"\"\n",
+        "# preceding_text: a fraction of previous context\n",
+        "# passage_text: a chunk of text.\n",
+        "# end_tag: a string, whose value is \"\" if the text is at the end of the article, and otherwise \"\\n...\".\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Gisting\n",
+        "gisting_prompt_template = \"\"\"\n",
+        "Please shorten the following passage.\n",
+        "Just give me a shortened version. DO NOT explain your reason.\n",
+        "\n",
+        "Passage:\n",
+        "{page_text}\n",
+        "\n",
+        "\"\"\"\n",
+        "# page_text: a page of text\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Parallel Look-up (ReadAgent-P, up to 2 pages)\n",
+        "parallel_lookup_prompt_template = \"\"\"\n",
+        "The following text is what you remembered from reading an article and a question related to it.\n",
+        "You may read 1 or 2 page(s) of the article again to refresh your memory to prepare yourselve for the question.\n",
+        "Please respond with which page(s) you would like to read in the order of importance, beginning with the most important page number.\n",
+        "For example, if your only need to read Page 8, respond with \\\"I want to look up Page [8] to ...\\\";\n",
+        "if your would like to read Page 12 and 7, respond with \\\"I want to look up Page [12, 7] to ...\\\";\n",
+        "DO NOT select more pages if you don't need to.\n",
+        "You don't need to answer the question yet.\n",
+        "\n",
+        "Text:\n",
+        "{concatenated_gists}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "\"\"\"\n",
+        "# concatenated_gists: Concatenated gists\n",
+        "# question: a question\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Sequential Look-up (ReadAgent-S)\n",
+        "sequential_lookup_prompt_template = \"\"\"\n",
+        "The following text is what you remember from reading a meeting transcript, followed by a question about the transcript.\n",
+        "You may read multiple pages of the transcript again to refresh your memory and prepare to answer the question.\n",
+        "Each page that you re-read can significantly improve your chance of answering the question correctly.\n",
+        "Please specify a SINGLE page you would like to read again or say \"STOP\".\n",
+        "To read a page again, respond with \"Page $PAGE_NUM\", replacing $PAGE_NUM with the target page number.\n",
+        "You can only specify a SINGLE page in your response at this time.\n",
+        "DO NOT select more pages if you don't need to.\n",
+        "To stop, simply say \"STOP\".\n",
+        "DO NOT answer the question in your response.\n",
+        "\n",
+        "Text:\n",
+        "{concatenated_gists}\n",
+        "End of text.\n",
+        "\n",
+        "Pages re-read already (DO NOT ask to read them again):\n",
+        "{past_page_numbers}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Specify a SINGLE page to read again, or say STOP:\n",
+        "\"\"\"\n",
+        "# concatenated_gists: concatenated gists\n",
+        "# past_page_numbers: page numbers that have already been retrieved\n",
+        "# question: a question\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Response/Answer\n",
+        "answer_prompt_template = \"\"\"\n",
+        "{concatenated_pages_and_gists}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Answer the question based on the above passage and retrieved pages. Your answer should be short and concise.\n",
+        "\"\"\"\n",
+        "# concatenated_pages_and_gists: concatenated raw pages and gists\n",
+        "# question: a question"
+      ],
+      "metadata": {
+        "id": "U210HZJuP4_u"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title The prompts we used for NarrativeQA - Movie Scripts with PaLM 2-L\n",
+        "\n",
+        "\n",
+        "# Pagination\n",
+        "pagination_prompt_template = \"\"\"\n",
+        "You are given a movie script and some numbered labels between the lines in the script.\n",
+        "Numbered label are in angeled brackets.\n",
+        "Please choose one label that it is natural to break reading. The label should be between <{start}> and <{end}>.\n",
+        "Such point can be scene transition, end of a dialogue, end of an argument, narrative transition, etc.\n",
+        "The answer should end with \"The break point is: <number>\", where the break point number is between angeled brackets.\n",
+        "\n",
+        "Script:\n",
+        "\n",
+        "{passage_text}\n",
+        "\"\"\"\n",
+        "# passage_text: a chunk of text.\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Gisting\n",
+        "gisting_prompt_template = \"\"\"\n",
+        "Please shorten the following passage. The shortened passage should be in 128 tokens. Please refer to people with their full names whenever possible.\n",
+        "Just give me a shortened version. DO NOT explain your reason. If there is no meaning information in the passage, output \"I don't have enough information to shorten the passage.\"\n",
+        "\n",
+        "Passage:\n",
+        "{page_text}\n",
+        "\n",
+        "\"\"\"\n",
+        "# page_text: a page of text\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Parallel Look-up (ReadAgent-P, up to 2 pages)\n",
+        "parallel_lookup_prompt_template = \"\"\"\n",
+        "The following text includes a summary of each page in a movie script, followed by a question about the script.\n",
+        "\n",
+        "Summary:\n",
+        "{concatenated_gists}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Based on the summary of each page, you may read the full details of 1 to 2 page(s) to obtain more information to answer the question.\n",
+        "Please respond with which page(s) you would like to read.\n",
+        "For example, if you only need to read Page X_0, the answer should end with \\\"I want to look up Page [X_0]\\\";\n",
+        "if you would like to read Page X_0 and X_1, the answer should end with \\\"I want to look up Page [X_0, X_1]\\\".\n",
+        "X_i above is a page index between 0 and {end}. DO NOT select more pages if you don't need to.\n",
+        "You don't need to answer the question yet.\n",
+        "\"\"\"\n",
+        "# concatenated_gists: Concatenated gists\n",
+        "# question: a question\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Sequential Look-up (ReadAgent-S)\n",
+        "sequential_lookup_prompt_template = \"\"\"\n",
+        "The following text includes a summary of each page in a movie script, followed by a question about the script, and the previous answer based on the summary and already re-read pages.\n",
+        "\n",
+        "Summary:\n",
+        "{concatenated_gists}\n",
+        "\n",
+        "Pages re-read already (DO NOT ask to read them again):\n",
+        "{past_page_numbers}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Previous Answer:\n",
+        "{previous_answer}\n",
+        "\n",
+        "Based on the summary of each page, you may read the full details of multiple pages to obtain more information to answer the question.\n",
+        "To read a page again, respond with \"I want to look up Page $PAGE_NUM\", replacing $PAGE_NUM with the target page number.\n",
+        "PAGE_NUM is a page index between 0 and {end}, excluding {pages_reread}.\n",
+        "You can only specify a SINGLE page in your response at this time. The page should not be in the re-read pages.\n",
+        "To stop, simply say \"STOP\".\n",
+        "DO NOT answer the question in your response.\n",
+        "You don't need to answer the question yet.\n",
+        "\"\"\"\n",
+        "# concatenated_gists: concatenated gists\n",
+        "# past_page_numbers: (only after the first query) page numbers that have already been retrieved\n",
+        "# question: a question\n",
+        "# previous_answer: the previous answer given by the model with gists and previously retrieved raw pages\n",
+        "\n",
+        "\n",
+        "\n",
+        "# Response/Answer\n",
+        "answer_prompt_template = \"\"\"\n",
+        "{concatenated_pages_and_gists}\n",
+        "\n",
+        "Question:\n",
+        "{question}\n",
+        "\n",
+        "Answer the question based on the above passage and retrieved pages. Your answer should be short and concise.\n",
+        "\"\"\"\n",
+        "# concatenated_pages_and_gists: concatenated raw pages and gists\n",
+        "# question: a question"
+      ],
+      "metadata": {
+        "id": "K1pw1NPasHPC"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}

baselines/read-agent/run_readagent_baseline.py ADDED Viewed

	@@ -0,0 +1,424 @@

+"""
+ReadAgent baseline for the EvolV-Mem benchmark.
+Adapts ReadAgent's 3-step pipeline for EvolV-Mem:
+  1. Pagination: each session = one page (natural segmentation, no LLM needed)
+  2. Gisting: use pre-computed session summaries (no LLM needed)
+  3. Look-up + Answer: LLM reads gists, selects pages to expand, then answers
+Since ~1000 sessions don't fit in one prompt as gists, we pre-filter with SBert
+to top-N sessions, then apply ReadAgent's look-up on those.
+Usage:
+    python baselines/read-agent.github.io/run_readagent_baseline.py \
+        --in_file dataset/evolv_mem_v4.json \
+        --out_file output/readagent_qwen30b_v4.jsonl \
+        --summary_file dataset/all_session_summary.json \
+        --sessions_file dataset/all_sessions.json \
+        --profile_file metadata/generated_user_profile.json
+Env vars:
+    VLLM_BASE_URL  (default http://localhost:8000/v1)
+    VLLM_API_KEY   (default EMPTY)
+"""
+import argparse
+import json
+import logging
+import os
+import re
+import sys
+import time
+from collections import defaultdict
+from typing import Dict, List, Optional
+import numpy as np
+from tqdm import tqdm
+logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
+# ---------------------------------------------------------------------------
+# vLLM LLM helper
+# ---------------------------------------------------------------------------
+MODEL_NAME = os.getenv("VLLM_MODEL_NAME", "Qwen/Qwen3-30B-A3B-Instruct-2507")
+def get_llm_client():
+    from openai import OpenAI
+    return OpenAI(
+        base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
+        api_key=os.getenv("VLLM_API_KEY", "EMPTY"),
+    )
+def llm_call(client, prompt: str, max_tokens: int = 4096, temperature: float = 0.0) -> str:
+    for attempt in range(6):
+        try:
+            response = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[{"role": "user", "content": prompt}],
+                max_tokens=max_tokens,
+                temperature=temperature,
+            )
+            content = response.choices[0].message.content if response.choices else None
+            if content is None:
+                # NVIDIA endpoint occasionally returns null content; retry once with same prompt
+                wait = min(2 ** attempt * 2, 30)
+                print(f"[WARN] LLM returned None content (attempt {attempt+1}); retrying in {wait}s")
+                time.sleep(wait)
+                continue
+            return content.strip()
+        except Exception as e:
+            msg = str(e).lower()
+            if any(code in msg for code in ("429", "500", "503", "rate limit")):
+                wait = min(2 ** attempt * 5, 60)
+                print(f"[WARN] LLM retry {attempt+1}/6, sleeping {wait}s: {e}")
+                time.sleep(wait)
+                continue
+            print(f"[ERROR] LLM call failed: {e}")
+            raise
+    raise RuntimeError("LLM call failed after 6 retries")
+# ---------------------------------------------------------------------------
+# ReadAgent prompts (adapted from the notebook for chat-history QA)
+# ---------------------------------------------------------------------------
+PROMPT_LOOKUP = """The following text contains gists (short summaries) of pages from a user's chat history. Each page is one conversation session.
+You are also given a question about the user's chat history.
+You may select up to {max_pages} page(s) to read in full to help answer the question.
+Please respond with which page(s) you would like to read.
+For example, if you only need to read Page 3, respond with "I want to look up Page [3] to ...";
+if you would like to read Page 2 and 7, respond with "I want to look up Page [2, 7] to ...".
+DO NOT select more pages than necessary.
+DO NOT answer the question yet.
+Text:
+{concatenated_gists}
+Question:
+{question}
+Current Date: {question_date}
+Take a deep breath and tell me: Which page(s) would you like to read again?
+"""
+PROMPT_ANSWER = """Read the following chat history and answer the question.
+{profile_section}
+Chat History:
+{concatenated_pages_and_gists}
+Current Date: {question_date}
+Question: {question}
+Answer:"""
+# ---------------------------------------------------------------------------
+# Embedding pre-filter
+# ---------------------------------------------------------------------------
+def embed_and_filter(
+    question: str,
+    session_ids: List[str],
+    session_gists: List[str],
+    embedding_model,
+    top_k: int = 100,
+) -> List[int]:
+    """Return indices of top-k most relevant sessions by cosine similarity."""
+    if len(session_ids) <= top_k:
+        return list(range(len(session_ids)))
+    question_emb = embedding_model.encode(question)
+    gist_embs = embedding_model.encode(session_gists)
+    question_norm = question_emb / (np.linalg.norm(question_emb) + 1e-10)
+    gist_norms = gist_embs / (np.linalg.norm(gist_embs, axis=1, keepdims=True) + 1e-10)
+    similarities = gist_norms @ question_norm
+    top_indices = np.argsort(similarities)[::-1][:top_k].tolist()
+    # Sort by original order (chronological) for coherent reading
+    top_indices.sort()
+    return top_indices
+# ---------------------------------------------------------------------------
+# ReadAgent Look-up: parse page selection from LLM response
+# ---------------------------------------------------------------------------
+def parse_lookup_pages(response: str, max_page: int) -> List[int]:
+    """Parse page indices from ReadAgent look-up response like 'Page [2, 7, 12]'."""
+    try:
+        start = response.index('[')
+        end = response.index(']')
+    except ValueError:
+        return []
+    page_ids = []
+    for p in response[start + 1:end].split(','):
+        p = p.strip()
+        if p.isnumeric():
+            pid = int(p)
+            if 0 <= pid < max_page:
+                page_ids.append(pid)
+    return page_ids
+# ---------------------------------------------------------------------------
+# Retrieval metrics
+# ---------------------------------------------------------------------------
+def evaluate_retrieval(recalled_docs, correct_docs):
+    recall_any = float(any(doc in recalled_docs for doc in correct_docs))
+    recall_all = float(all(doc in recalled_docs for doc in correct_docs))
+    return recall_any, recall_all
+def print_average_metrics(retrieval_metric_list):
+    metric_sums = defaultdict(float)
+    metric_counts = defaultdict(int)
+    for metric in retrieval_metric_list:
+        for k, v in metric.items():
+            metric_sums[k] += v
+            metric_counts[k] += 1
+    print("  Average retrieval metrics:")
+    for k in sorted(metric_sums):
+        avg = metric_sums[k] / metric_counts[k]
+        print(f"    {k}: {avg:.4f}")
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="ReadAgent baseline for EvolV-Mem")
+    parser.add_argument("--in_file", type=str, required=True,
+                        help="Path to evolv_mem_v4.json")
+    parser.add_argument("--out_file", type=str, required=True,
+                        help="Output JSONL file")
+    parser.add_argument("--summary_file", type=str, required=True,
+                        help="Path to all_session_summary.json")
+    parser.add_argument("--sessions_file", type=str, required=True,
+                        help="Path to all_sessions.json")
+    parser.add_argument("--profile_file", type=str, default=None,
+                        help="Path to generated_user_profile.json")
+    parser.add_argument("--embedding_model", type=str,
+                        default="sentence-transformers/multi-qa-mpnet-base-cos-v1",
+                        help="SentenceTransformer model for pre-filtering")
+    # ReadAgent params
+    parser.add_argument("--gist_top_k", type=int, default=100,
+                        help="Number of sessions to keep after embedding pre-filter (default 100)")
+    parser.add_argument("--max_lookup_pages", type=int, default=10,
+                        help="Max pages the LLM can select for full reading (default 10)")
+    # Limit
+    parser.add_argument("--limit", type=int, default=None,
+                        help="Process only the first N questions")
+    args = parser.parse_args()
+    # -----------------------------------------------------------------------
+    # Load data
+    # -----------------------------------------------------------------------
+    print(f"Loading benchmark from {args.in_file} ...")
+    with open(args.in_file) as f:
+        benchmark = json.load(f)
+    if args.limit:
+        benchmark = benchmark[:args.limit]
+    print(f"  {len(benchmark)} questions loaded.")
+    print(f"Loading session summaries from {args.summary_file} ...")
+    with open(args.summary_file) as f:
+        summaries = json.load(f)
+    print(f"  {len(summaries)} session summaries loaded.")
+    print(f"Loading sessions from {args.sessions_file} ...")
+    with open(args.sessions_file) as f:
+        all_sessions = json.load(f)
+    print(f"  {len(all_sessions)} sessions loaded.")
+    profiles = {}
+    if args.profile_file and os.path.exists(args.profile_file):
+        print(f"Loading user profiles from {args.profile_file} ...")
+        with open(args.profile_file) as f:
+            profiles = json.load(f)
+        print(f"  {len(profiles)} profiles loaded.")
+    # -----------------------------------------------------------------------
+    # Resume support
+    # -----------------------------------------------------------------------
+    existing_qids = set()
+    if os.path.exists(args.out_file):
+        with open(args.out_file) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    existing_qids.add(json.loads(line)["question_id"])
+        print(f"  Resuming: {len(existing_qids)} questions already processed.")
+    # -----------------------------------------------------------------------
+    # Initialize models
+    # -----------------------------------------------------------------------
+    print("Initializing embedding model ...")
+    from sentence_transformers import SentenceTransformer
+    embedding_model = SentenceTransformer(args.embedding_model)
+    print("Initializing vLLM client ...")
+    client = get_llm_client()
+    # -----------------------------------------------------------------------
+    # Process questions
+    # -----------------------------------------------------------------------
+    retrieval_metric_list = []
+    out_f = open(args.out_file, "a")
+    for di, entry in enumerate(tqdm(benchmark, desc="ReadAgent baseline")):
+        qid = entry["question_id"]
+        question = entry["question"]
+        question_date = entry["question_date"]
+        if qid in existing_qids:
+            continue
+        try:
+            haystack_sids = entry["haystack_session_ids"]
+            haystack_dates = entry["haystack_dates"]
+            # === Step 1 & 2: Pagination + Gisting (free with cached data) ===
+            # Each session = one page; session summary = gist
+            page_sids = []
+            page_dates = []
+            page_gists = []
+            for sid, date_str in zip(haystack_sids, haystack_dates):
+                summary_data = summaries.get(sid)
+                if summary_data is None:
+                    continue
+                text = summary_data.get("session_summary", "")
+                if not text:
+                    turn_sums = summary_data.get("turn_summaries", [])
+                    text = " ".join(turn_sums) if turn_sums else ""
+                if not text:
+                    continue
+                page_sids.append(sid)
+                page_dates.append(date_str)
+                page_gists.append(text)
+            if not page_gists:
+                result = {
+                    "q_idx": di, "question_id": qid,
+                    "hypothesis": "Insufficient information to answer.",
+                    "n_pages": 0,
+                }
+                print(json.dumps(result), file=out_f, flush=True)
+                continue
+            # === Embedding pre-filter to top-K gists ===
+            filtered_indices = embed_and_filter(
+                question, page_sids, page_gists, embedding_model, top_k=args.gist_top_k
+            )
+            filtered_sids = [page_sids[i] for i in filtered_indices]
+            filtered_dates = [page_dates[i] for i in filtered_indices]
+            filtered_gists = [page_gists[i] for i in filtered_indices]
+            # Build gist text with page numbers
+            gist_lines = []
+            for local_idx, (sid, date, gist) in enumerate(
+                zip(filtered_sids, filtered_dates, filtered_gists)
+            ):
+                gist_lines.append(f"<Page {local_idx}> [Date: {date}]\n{gist}")
+            concatenated_gists = "\n\n".join(gist_lines)
+            # === Step 3a: ReadAgent Look-Up ===
+            lookup_prompt = PROMPT_LOOKUP.format(
+                max_pages=args.max_lookup_pages,
+                concatenated_gists=concatenated_gists,
+                question=question,
+                question_date=question_date,
+            )
+            lookup_response = llm_call(client, lookup_prompt, max_tokens=4096, temperature=0.0)
+            selected_page_ids = parse_lookup_pages(lookup_response, len(filtered_sids))
+            print(f"  [{di}] Look-up selected pages: {selected_page_ids} "
+                  f"(out of {len(filtered_sids)} gists)")
+            # === Step 3b: Expand selected pages with full session content ===
+            expanded_lines = []
+            retrieved_session_ids = []
+            for local_idx, (sid, date, gist) in enumerate(
+                zip(filtered_sids, filtered_dates, filtered_gists)
+            ):
+                if local_idx in selected_page_ids:
+                    # Expand: use full session content
+                    session_turns = all_sessions.get(sid, [])
+                    full_text = "\n".join(
+                        f"{t.get('role', 'user')}: {t.get('content', '')}"
+                        for t in session_turns
+                    )
+                    expanded_lines.append(
+                        f"<Page {local_idx}> [Session {sid} | Date: {date}] [FULL]\n{full_text}"
+                    )
+                    retrieved_session_ids.append(sid)
+                else:
+                    # Keep gist
+                    expanded_lines.append(
+                        f"<Page {local_idx}> [Date: {date}] [GIST]\n{gist}"
+                    )
+            concatenated_pages_and_gists = "\n\n".join(expanded_lines)
+            # === Step 3c: Answer ===
+            user_id = qid.split("_q_")[0] if "_q_" in qid else qid
+            user_profile = profiles.get(user_id, None)
+            profile_section = f"User Profile:\n{user_profile}" if user_profile else ""
+            answer_prompt = PROMPT_ANSWER.format(
+                profile_section=profile_section,
+                concatenated_pages_and_gists=concatenated_pages_and_gists,
+                question=question,
+                question_date=question_date,
+            )
+            answer = llm_call(client, answer_prompt, max_tokens=8192, temperature=0.0)
+            # === Retrieval metrics ===
+            answer_session_ids = entry.get("answer_session_ids", [])
+            retrieval_metric = {}
+            if answer_session_ids and retrieved_session_ids:
+                for topk in [5, 10, 20, 30]:
+                    r_any, r_all = evaluate_retrieval(
+                        retrieved_session_ids[:topk], answer_session_ids
+                    )
+                    retrieval_metric[f"recall_any@{topk}"] = r_any
+                    retrieval_metric[f"recall_all@{topk}"] = r_all
+                retrieval_metric_list.append(retrieval_metric)
+                print_average_metrics(retrieval_metric_list)
+            # === Output ===
+            result = {
+                "q_idx": di,
+                "question_id": qid,
+                "hypothesis": answer,
+                "n_pages_total": len(page_gists),
+                "n_pages_filtered": len(filtered_sids),
+                "n_pages_expanded": len(selected_page_ids),
+                "retrieved_session_ids": retrieved_session_ids,
+                "retrieval_metric": retrieval_metric,
+            }
+            print(json.dumps(result), file=out_f, flush=True)
+            print(f"  [{di}] Q: {question[:100]}...")
+            print(f"  [{di}] A: {answer[:200]}...")
+        except Exception as e:
+            print(f"[ERROR] q_idx={di} qid={qid} failed: {e}", flush=True)
+            import traceback
+            traceback.print_exc()
+            continue
+    out_f.close()
+    print(f"\nDone. Results saved to {args.out_file}")
+if __name__ == "__main__":
+    main()

evaluate_qa.py ADDED Viewed

	@@ -0,0 +1,916 @@

+import argparse
+import json
+import os
+import re
+import time
+import numpy as np
+from openai import OpenAI
+from tqdm import tqdm
+try:
+    from openai import AzureOpenAI
+    from azure.identity import (
+        AzureCliCredential,
+        ChainedTokenCredential,
+        ManagedIdentityCredential,
+        get_bearer_token_provider,
+    )
+    AZURE_OAUTH_SCOPE = os.environ.get("AZURE_OAUTH_SCOPE", "")
+    if AZURE_OAUTH_SCOPE:
+        credential = get_bearer_token_provider(
+            ChainedTokenCredential(
+                AzureCliCredential(),
+                ManagedIdentityCredential(),
+            ),
+            AZURE_OAUTH_SCOPE,
+        )
+    else:
+        credential = None
+except ImportError:
+    AzureOpenAI = None
+    credential = None
+from model_zoo import model_zoo
+# Azure OpenAI endpoint (set AZURE_OPENAI_ENDPOINT env var to your deployment URL).
+endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT", "")
+# OpenAI-compatible LiteLLM proxy URL (set LITELLM_BASE_URL env var to your proxy).
+TRITONAI_BASE_URL = os.environ.get("LITELLM_BASE_URL", "")
+ATOMIC_PROMPT_VERSION = "atomic-v1"
+LEGACY_PROMPT_VERSION = "binary-v0"
+ATOM_SCORES = {
+    "correct": 1.0,
+    "partially_correct": 0.5,
+    "missing": 0.0,
+    "incorrect": 0.0,
+}
+def _retryable_status(e) -> "int | None":
+    status = getattr(e, "status_code", None) or getattr(e, "http_status", None)
+    if status in (429, 500, 503, 403):
+        return status
+    resp = getattr(e, "response", None)
+    if resp is not None and getattr(resp, "status_code", None) in (429, 500, 503):
+        return resp.status_code
+    msg = str(e).lower()
+    if "429" in msg or "rate limit" in msg:
+        return 429
+    if "500" in msg or "internal server error" in msg:
+        return 500
+    if "503" in msg or "api configuration unavailable" in msg:
+        return 503
+    return None
+def parse_json_object(text):
+    text = (text or "").strip()
+    if text.startswith("```"):
+        text = re.sub(r"^```(?:json)?", "", text).strip()
+        text = re.sub(r"```$", "", text).strip()
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        start = text.find("{")
+        end = text.rfind("}") + 1
+        if start >= 0 and end > start:
+            return json.loads(text[start:end])
+        raise
+def sanitize_model_name(name):
+    return re.sub(r"[^A-Za-z0-9_.-]+", "_", name)
+def read_json_or_jsonl(path):
+    try:
+        with open(path, "r", encoding="utf-8") as f:
+            return json.load(f)
+    except json.JSONDecodeError:
+        with open(path, "r", encoding="utf-8") as f:
+            return [json.loads(line) for line in f if line.strip()]
+def read_existing_jsonl(path):
+    if not path or not os.path.exists(path):
+        return {}
+    rows = {}
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            if not line.strip():
+                continue
+            obj = json.loads(line)
+            if "question_id" in obj:
+                rows[obj["question_id"]] = obj
+    return rows
+def write_json(path, obj):
+    tmp_path = path + ".tmp"
+    with open(tmp_path, "w", encoding="utf-8") as f:
+        json.dump(obj, f, ensure_ascii=False, indent=2)
+        f.write("\n")
+    os.replace(tmp_path, path)
+def append_jsonl(path, row):
+    with open(path, "a", encoding="utf-8") as f:
+        print(json.dumps(row, ensure_ascii=False), file=f, flush=True)
+def default_result_file(hyp_file, metric_model, eval_mode):
+    if eval_mode == "legacy":
+        return f"{hyp_file}.eval-results-{metric_model}"
+    model_tag = sanitize_model_name(metric_model)
+    return f"{hyp_file}.eval-results-{model_tag}-{ATOMIC_PROMPT_VERSION}-{eval_mode}"
+def default_rubric_file(ref_file):
+    return f"{ref_file}.{ATOMIC_PROMPT_VERSION}.rubric.json"
+def load_rubric_file(path, ref_file):
+    if not path or not os.path.exists(path):
+        return {
+            "prompt_version": ATOMIC_PROMPT_VERSION,
+            "source_ref_file": ref_file,
+            "rubrics": {},
+        }
+    with open(path, "r", encoding="utf-8") as f:
+        data = json.load(f)
+    if "rubrics" in data:
+        data.setdefault("prompt_version", ATOMIC_PROMPT_VERSION)
+        data.setdefault("source_ref_file", ref_file)
+        return data
+    return {
+        "prompt_version": ATOMIC_PROMPT_VERSION,
+        "source_ref_file": ref_file,
+        "rubrics": data,
+    }
+def question_type_guidance(task):
+    if task in ["Information Absence"]:
+        return (
+            "The correct answer is that the information is unavailable, absent, "
+            "not yet known, or not supported. A response that gives a concrete "
+            "answer instead of abstaining is incorrect."
+        )
+    if task in ["Aggregation", "single-session-user", "single-session-assistant", "multi-session"]:
+        return (
+            "Check every requested item, count, list member, and named fact. "
+            "Exact counts and required list coverage matter. Extra material "
+            "facts that change the answer should be flagged."
+        )
+    if task in ["Aggregation + Temporal"]:
+        return (
+            "Check both the aggregated facts and their time/order associations. "
+            "An answer can name the right item but still be wrong if the timing, "
+            "ordering, before/after relation, or year is wrong."
+        )
+    if task in ["Temporal Reasoning", "temporal-reasoning"]:
+        return (
+            "Check the specific time, date, year, sequence, duration, or temporal "
+            "relationship asked for. Accept +/-1 only for day/week/month duration "
+            "counts, not for years, event identity, or ordering."
+        )
+    if task in ["Knowledge Update", "knowledge-update"]:
+        return (
+            "Check the current or most recent state. Historical context is fine "
+            "only if the final/current state is clearly correct. Outdated states "
+            "presented as current are incorrect."
+        )
+    if task == "single-session-preference":
+        return (
+            "Check whether the response recalls and applies the stated preference. "
+            "Do not require unnecessary verbosity, but contradictions are incorrect."
+        )
+    return "Check whether the model response satisfies all required facts in the reference answer."
+def get_anscheck_prompt(task, question, answer, response):
+    """Legacy binary yes/no prompt kept for backward compatibility."""
+    if task in ["Information Absence"]:
+        template = """You are evaluating whether a model correctly identified that a question is unanswerable given the available personal chat history.
+Question: {question}
+Explanation of why it is unanswerable: {answer}
+Model Response: {response}
+Evaluation criteria:
+- CORRECT if the model explicitly states that the information is not available, insufficient, or that the question cannot be answered from the provided context.
+- INCORRECT if the model fabricates an answer or fails to acknowledge the unanswerable nature of the question.
+- The model does not need to use the exact word "unanswerable" -- expressing uncertainty or lack of information is sufficient.
+Briefly explain your reasoning (1-2 sentences), then on the last line write only: yes or no"""
+    elif task in ["Aggregation", "single-session-user", "single-session-assistant", "multi-session"]:
+        template = """You are evaluating whether a model correctly answered a question that requires aggregating specific facts from a user's personal chat history.
+Question: {question}
+Reference Answer: {answer}
+Model Response: {response}
+Evaluation criteria:
+- CORRECT if the response identifies all key items, facts, or counts present in the reference answer, even if phrased differently or with added context.
+- INCORRECT if the response:
+  - States a wrong count (e.g., says "5" when the answer is "3")
+  - Omits one or more key items/facts listed in the reference
+  - Lists mostly wrong items even if the count is right
+- Partial answers that cover only a subset of required items are INCORRECT.
+- Verbose responses are acceptable as long as all reference items are present within them.
+- If the response contains correct items but also lists additional plausible-sounding but unverified items beyond the reference, this does NOT make it incorrect -- evaluate only whether the reference items are covered.
+- Semantic equivalence counts as correct (e.g., "RSI" = "Repetitive Strain Injury").
+Briefly explain your reasoning (2-3 sentences), then on the last line write only: yes or no"""
+    elif task in ["Aggregation + Temporal"]:
+        template = """You are evaluating whether a model correctly answered a question that requires both aggregating facts AND reasoning about their temporal order or time associations from a user's personal chat history.
+Question: {question}
+Reference Answer: {answer}
+Model Response: {response}
+Evaluation criteria:
+- CORRECT if the response captures both:
+  (a) all key events/facts listed in the reference, and
+  (b) their correct temporal associations (ordering, time periods, or "before/after" relationships).
+- INCORRECT if the response:
+  - Omits one or more key events or facts from the reference
+  - Gets the temporal ordering or time associations wrong
+  - Captures only the content without the temporal aspects, or vice versa
+- Responses that describe the correct progression/sequence in different words are acceptable.
+- Partial answers covering only some events or ignoring time aspects are INCORRECT.
+- Minor wording differences or additional explanatory context are acceptable.
+Briefly explain your reasoning (2-3 sentences), then on the last line write only: yes or no"""
+    elif task in ["Temporal Reasoning", "temporal-reasoning"]:
+        template = """You are evaluating whether a model correctly answered a question about temporal relationships in a user's personal chat history.
+Question: {question}
+Reference Answer: {answer}
+Model Response: {response}
+Evaluation criteria:
+- CORRECT if the response correctly identifies the specific time, date, year, sequence, or temporal relationship asked about.
+- INCORRECT if the response states a wrong year, wrong sequence, wrong temporal relationship, or misidentifies which event came first/last.
+- Off-by-one tolerance: if the question asks for a count of days, weeks, or months, accept answers that differ by +/-1. This tolerance does NOT apply to years or to identifying specific events/artifacts.
+- Responses that correctly identify the fact but with verbose context are acceptable.
+- If the response hedges but still states the correct answer, it is correct.
+Briefly explain your reasoning (2-3 sentences), then on the last line write only: yes or no"""
+    elif task in ["Knowledge Update", "knowledge-update"]:
+        template = """You are evaluating whether a model correctly answered a question about the most recent or current state of something that changed over time in a user's personal chat history.
+Question: {question}
+Reference Answer: {answer}
+Model Response: {response}
+Evaluation criteria:
+- CORRECT if the response correctly identifies the most recent/current state as described in the reference answer.
+- The response may include earlier historical states as context -- this is acceptable as long as the current/final state is correctly identified and clearly stated.
+- INCORRECT if the response:
+  - States an outdated/superseded state as the current one
+  - Omits the current state entirely
+  - Correctly describes history but draws the wrong conclusion about what the current state is
+- Semantic equivalence counts (e.g., "flexitarian" and "semi-vegetarian diet with occasional meat" are equivalent if contextually clear).
+Briefly explain your reasoning (2-3 sentences), then on the last line write only: yes or no"""
+    elif task == "single-session-preference":
+        template = """You are evaluating whether a model correctly answered a personalized question based on a user's stated preferences from their chat history.
+Question: {question}
+Reference Rubric: {answer}
+Model Response: {response}
+Evaluation criteria:
+- CORRECT if the response recalls and applies the user's personal preferences correctly, even if not covering every point in the rubric.
+- INCORRECT if the response ignores, contradicts, or misremembers the user's preferences.
+Briefly explain your reasoning (1-2 sentences), then on the last line write only: yes or no"""
+    else:
+        template = """You are evaluating whether a model's response correctly answers a question based on a user's personal chat history.
+Question: {question}
+Reference Answer: {answer}
+Model Response: {response}
+Is the response correct? It is correct if it contains all key information from the reference answer, even if phrased differently.
+Briefly explain your reasoning (1-2 sentences), then on the last line write only: yes or no"""
+    return template.format(question=question, answer=answer, response=response)
+def build_rubric_prompt(task, question, answer):
+    return f"""You are creating an atomic grading rubric for an open-ended QA benchmark.
+Question type: {task}
+Question:
+{question}
+Reference answer:
+{answer}
+Question-type guidance:
+{question_type_guidance(task)}
+Decompose the reference answer into the smallest independently checkable requirements needed to answer the question.
+Rules:
+- Each atom should be a single required answer unit: an entity, count, date/year, order relation, current-state conclusion, or abstention requirement.
+- If an entity and its temporal relation are inseparable for correctness, keep them in the same atom.
+- For list/count questions, include an atom for the exact count when the question asks "how many", and atoms for each required listed item when the item identities matter.
+- For Information Absence, usually use one atom requiring the response to clearly state that the information is unavailable/insufficient/not discussed, and add a strict note that concrete fabricated answers are wrong.
+- Do not include supporting evidence requirements or session IDs unless the question explicitly asks for them.
+- Weights should normally be 1.0. Use a higher weight only when one atom is clearly the main answer and other atoms are minor.
+Return JSON only with this schema:
+{{
+  "required_atoms": [
+    {{
+      "id": "a1",
+      "requirement": "short, specific grading requirement",
+      "weight": 1.0
+    }}
+  ],
+  "strict_notes": ["short note about exactness, ordering, abstention, or hallucination handling"]
+}}"""
+def build_atomic_eval_prompt(task, question, answer, response, rubric):
+    rubric_str = json.dumps(
+        {
+            "required_atoms": rubric["required_atoms"],
+            "strict_notes": rubric.get("strict_notes", []),
+        },
+        ensure_ascii=False,
+        indent=2,
+    )
+    return f"""You are an LLM-as-a-judge evaluating one model response against a reference answer.
+Question type: {task}
+Question:
+{question}
+Reference answer:
+{answer}
+Model response:
+{response}
+Atomic grading rubric:
+{rubric_str}
+Question-type guidance:
+{question_type_guidance(task)}
+Judge each atom independently.
+Atom labels:
+- correct: the response fully satisfies this atom, allowing semantic paraphrase.
+- partially_correct: the response gets the main idea but is incomplete or slightly underspecified. Use this sparingly. Do not use it for wrong counts, wrong years, wrong named entities, wrong ordering, or a concrete answer to an Information Absence question.
+- missing: the response does not address this atom.
+- incorrect: the response contradicts this atom or gives the wrong count/entity/date/order/current state.
+Also identify unsupported_or_contradictory material:
+- severity "material": extra answer content that changes the final answer, adds extra items to an exact list/count, gives an outdated current state, fabricates a concrete answer for Information Absence, or contradicts any atom.
+- severity "minor": harmless context or extra explanation that does not change the answer.
+Return JSON only with this schema:
+{{
+  "atom_judgments": [
+    {{
+      "id": "a1",
+      "label": "correct|partially_correct|missing|incorrect",
+      "rationale": "brief reason"
+    }}
+  ],
+  "unsupported_or_contradictory": [
+    {{
+      "text": "extra or contradictory claim",
+      "severity": "minor|material",
+      "rationale": "brief reason"
+    }}
+  ],
+  "absence_mismatch": false,
+  "overall_rationale": "one or two sentence summary"
+}}"""
+def llm_call(
+    deployment_name: str,
+    api_version: str,
+    _prompt: str,
+    debug: bool = False,
+    vllm: bool = False,
+    tritonai: bool = False,
+    nvidia: bool = False,
+):
+    if nvidia:
+        client = OpenAI(
+            api_key=os.getenv("NV_API_KEY"),
+            base_url="https://inference-api.nvidia.com",
+        )
+        while True:
+            try:
+                return client.chat.completions.create(
+                    model=deployment_name,
+                    messages=[{"role": "user", "content": _prompt}],
+                )
+            except Exception as e:
+                st = _retryable_status(e)
+                if st in (429, 500, 503, 403):
+                    print(f"[WARN] HTTP {st} from NVIDIA API; sleeping 60s then retrying...", flush=True)
+                    time.sleep(60)
+                    continue
+                print("One exception captured", repr(e), flush=True)
+                raise
+    if tritonai:
+        client = OpenAI(
+            api_key=os.getenv("TRITONAI_API_KEY"),
+            base_url=TRITONAI_BASE_URL,
+        )
+        while True:
+            try:
+                return client.chat.completions.create(
+                    model=deployment_name,
+                    messages=[{"role": "user", "content": _prompt}],
+                )
+            except Exception as e:
+                st = _retryable_status(e)
+                if st in (429, 500, 503, 403):
+                    print(f"[WARN] HTTP {st} from LiteLLM proxy; sleeping 60s then retrying...", flush=True)
+                    time.sleep(60)
+                    continue
+                print("One exception captured", repr(e), flush=True)
+                raise
+    if deployment_name.startswith("claude-"):
+        import anthropic
+        client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
+        while True:
+            try:
+                msg = client.messages.create(
+                    model=deployment_name,
+                    max_tokens=1024,
+                    messages=[{"role": "user", "content": _prompt}],
+                )
+                class _Choice:
+                    class _Msg:
+                        def __init__(self, text):
+                            self.content = text
+                    def __init__(self, text):
+                        self.message = self._Msg(text)
+                class _Completion:
+                    def __init__(self, text):
+                        self.choices = [_Choice(text)]
+                return _Completion(msg.content[0].text)
+            except Exception as e:
+                st = _retryable_status(e)
+                if st in (429, 500, 503, 403):
+                    print(f"[WARN] HTTP {st} from Anthropic; sleeping 60s then retrying...", flush=True)
+                    time.sleep(60)
+                    continue
+                print("One exception captured", repr(e), flush=True)
+                raise
+    if vllm:
+        client = OpenAI(
+            base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
+            api_key=os.getenv("VLLM_API_KEY", "EMPTY"),
+        )
+    elif debug:
+        client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+    else:
+        client = AzureOpenAI(
+            azure_endpoint=endpoint,
+            azure_ad_token_provider=credential,
+            api_version=api_version,
+        )
+    kwargs = {
+        "model": deployment_name,
+        "messages": [{"role": "system", "content": _prompt}],
+    }
+    while True:
+        try:
+            return client.chat.completions.create(**kwargs)
+        except Exception as e:
+            st = _retryable_status(e)
+            if st in (429, 500, 503, 403):
+                print(f"[WARN] HTTP {st} from LLM; sleeping 120s then retrying...", flush=True)
+                time.sleep(120)
+                continue
+            print("One exception captured", repr(e), flush=True)
+            raise
+def call_json_llm(prompt, deployment_name, api_version, args, max_retries=3):
+    last_error = None
+    for attempt in range(max_retries):
+        completion = llm_call(
+            deployment_name,
+            api_version,
+            prompt,
+            debug=args.debug,
+            vllm=args.vllm,
+            tritonai=args.tritonai,
+            nvidia=args.nvidia,
+        )
+        content = completion.choices[0].message.content.strip()
+        try:
+            return parse_json_object(content), content
+        except Exception as e:
+            last_error = e
+            if attempt < max_retries - 1:
+                print(f"[WARN] Failed to parse judge JSON; retrying ({attempt + 1}/{max_retries})", flush=True)
+                time.sleep(2)
+    raise ValueError(f"Failed to parse JSON response from judge: {last_error}")
+def fallback_rubric(qid, task, question, answer):
+    return {
+        "question_id": qid,
+        "question_type": task,
+        "question": question,
+        "reference_answer": answer,
+        "required_atoms": [
+            {
+                "id": "a1",
+                "requirement": f"Response must correctly answer the question according to the reference answer: {answer}",
+                "weight": 1.0,
+            }
+        ],
+        "strict_notes": ["Fallback single-atom rubric produced because the generated rubric was invalid."],
+        "prompt_version": ATOMIC_PROMPT_VERSION,
+    }
+def normalize_rubric(qid, task, question, answer, parsed):
+    atoms = parsed.get("required_atoms", []) if isinstance(parsed, dict) else []
+    norm_atoms = []
+    for idx, atom in enumerate(atoms, start=1):
+        if not isinstance(atom, dict):
+            continue
+        requirement = str(atom.get("requirement", "")).strip()
+        if not requirement:
+            continue
+        atom_id = str(atom.get("id", f"a{idx}")).strip() or f"a{idx}"
+        try:
+            weight = float(atom.get("weight", 1.0))
+        except (TypeError, ValueError):
+            weight = 1.0
+        if weight <= 0:
+            weight = 1.0
+        norm_atoms.append({"id": atom_id, "requirement": requirement, "weight": weight})
+    if not norm_atoms:
+        return fallback_rubric(qid, task, question, answer)
+    strict_notes = parsed.get("strict_notes", [])
+    if not isinstance(strict_notes, list):
+        strict_notes = [str(strict_notes)]
+    return {
+        "question_id": qid,
+        "question_type": task,
+        "question": question,
+        "reference_answer": answer,
+        "required_atoms": norm_atoms,
+        "strict_notes": [str(x) for x in strict_notes],
+        "prompt_version": ATOMIC_PROMPT_VERSION,
+    }
+def get_or_build_rubric(qdata, rubric_data, rubric_file, deployment_name, api_version, args):
+    qid = qdata["question_id"]
+    existing = rubric_data["rubrics"].get(qid)
+    if (
+        existing
+        and existing.get("prompt_version") == ATOMIC_PROMPT_VERSION
+        and existing.get("required_atoms")
+        and not args.force_rebuild_rubric
+    ):
+        return existing
+    task = qdata["question_type"]
+    prompt = build_rubric_prompt(task, qdata["question"], qdata["answer"])
+    try:
+        parsed, raw = call_json_llm(prompt, deployment_name, api_version, args)
+        rubric = normalize_rubric(qid, task, qdata["question"], qdata["answer"], parsed)
+        rubric["rubric_raw_response"] = raw
+    except Exception as e:
+        print(f"[WARN] Falling back to single-atom rubric for {qid}: {e}", flush=True)
+        rubric = fallback_rubric(qid, task, qdata["question"], qdata["answer"])
+    rubric_data["rubrics"][qid] = rubric
+    write_json(rubric_file, rubric_data)
+    return rubric
+def compute_atomic_scores(rubric, parsed):
+    atoms = rubric.get("required_atoms", [])
+    judgments_by_id = {}
+    raw_judgments = parsed.get("atom_judgments", []) if isinstance(parsed, dict) else []
+    if isinstance(raw_judgments, list):
+        for judgment in raw_judgments:
+            if not isinstance(judgment, dict):
+                continue
+            atom_id = str(judgment.get("id", "")).strip()
+            label = str(judgment.get("label", "")).strip()
+            if label not in ATOM_SCORES:
+                label = "incorrect"
+            judgments_by_id[atom_id] = {
+                "id": atom_id,
+                "label": label,
+                "score": ATOM_SCORES[label],
+                "rationale": str(judgment.get("rationale", "")),
+            }
+    norm_judgments = []
+    weighted_score = 0.0
+    total_weight = 0.0
+    for atom in atoms:
+        atom_id = atom["id"]
+        weight = float(atom.get("weight", 1.0))
+        judgment = judgments_by_id.get(
+            atom_id,
+            {"id": atom_id, "label": "missing", "score": 0.0, "rationale": "No judgment returned."},
+        )
+        judgment["requirement"] = atom["requirement"]
+        judgment["weight"] = weight
+        norm_judgments.append(judgment)
+        weighted_score += judgment["score"] * weight
+        total_weight += weight
+    extras = parsed.get("unsupported_or_contradictory", []) if isinstance(parsed, dict) else []
+    if not isinstance(extras, list):
+        extras = []
+    material_extras = [
+        x for x in extras
+        if isinstance(x, dict) and str(x.get("severity", "")).strip() == "material"
+    ]
+    absence_mismatch = bool(parsed.get("absence_mismatch", False)) if isinstance(parsed, dict) else False
+    strict_label = (
+        bool(norm_judgments)
+        and all(j["label"] == "correct" for j in norm_judgments)
+        and not material_extras
+        and not absence_mismatch
+    )
+    partial_score = weighted_score / total_weight if total_weight > 0 else 0.0
+    if absence_mismatch:
+        partial_score = 0.0
+    elif material_extras and partial_score > 0.8:
+        partial_score = 0.8
+    return {
+        "strict_label": strict_label,
+        "partial_score": round(partial_score, 4),
+        "atom_judgments": norm_judgments,
+        "unsupported_or_contradictory": extras,
+        "absence_mismatch": absence_mismatch,
+        "overall_rationale": str(parsed.get("overall_rationale", "")) if isinstance(parsed, dict) else "",
+    }
+def judge_atomic(qdata, hypothesis, rubric, deployment_name, api_version, args):
+    prompt = build_atomic_eval_prompt(
+        qdata["question_type"],
+        qdata["question"],
+        qdata["answer"],
+        hypothesis,
+        rubric,
+    )
+    parsed, raw = call_json_llm(prompt, deployment_name, api_version, args)
+    scores = compute_atomic_scores(rubric, parsed)
+    return {
+        "model": args.eval_model_name,
+        "prompt_version": ATOMIC_PROMPT_VERSION,
+        "eval_mode": args.eval_mode,
+        "strict_label": scores["strict_label"],
+        "partial_score": scores["partial_score"],
+        "required_atoms": rubric["required_atoms"],
+        "atom_judgments": scores["atom_judgments"],
+        "unsupported_or_contradictory": scores["unsupported_or_contradictory"],
+        "absence_mismatch": scores["absence_mismatch"],
+        "overall_rationale": scores["overall_rationale"],
+        "raw_response": raw,
+    }
+def should_skip_existing(existing_row, eval_mode):
+    if eval_mode == "legacy":
+        return "autoeval_label" in existing_row
+    atomic = existing_row.get("autoeval_atomic")
+    return bool(atomic and atomic.get("prompt_version") == ATOMIC_PROMPT_VERSION)
+def safe_mean(values):
+    if not values:
+        return float("nan")
+    return float(np.mean(values))
+def print_legacy_summary(logs, qtype2acc):
+    labels = [1 if x["autoeval_label"]["label"] else 0 for x in logs if "autoeval_label" in x]
+    print("Accuracy:", round(safe_mean(labels), 4))
+    for k, v in sorted(qtype2acc.items()):
+        print("\t{}: {} ({})".format(k, round(safe_mean(v), 4), len(v)))
+def print_atomic_summary(logs, qtype2strict, qtype2partial, eval_mode):
+    strict_values = [
+        1 if x["autoeval_atomic"]["strict_label"] else 0
+        for x in logs
+        if "autoeval_atomic" in x
+    ]
+    partial_values = [
+        float(x["autoeval_atomic"]["partial_score"])
+        for x in logs
+        if "autoeval_atomic" in x
+    ]
+    if eval_mode in ("strict", "both"):
+        print("Strict Accuracy:", round(safe_mean(strict_values), 4))
+        for k, v in sorted(qtype2strict.items()):
+            print("\t{}: {} ({})".format(k, round(safe_mean(v), 4), len(v)))
+    if eval_mode in ("partial", "both"):
+        print("Partial Score:", round(safe_mean(partial_values), 4))
+        for k, v in sorted(qtype2partial.items()):
+            print("\t{}: {} ({})".format(k, round(safe_mean(v), 4), len(v)))
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--hyp_file", type=str, default=None)
+    parser.add_argument("--ref_file", type=str, required=True)
+    parser.add_argument("--eval_model_name", type=str, required=True)
+    parser.add_argument(
+        "--eval_mode",
+        type=str,
+        default="both",
+        choices=["legacy", "strict", "partial", "both"],
+        help="legacy uses the old yes/no judge; strict/partial/both use atomic JSON judging.",
+    )
+    parser.add_argument("--rubric_file", type=str, default=None)
+    parser.add_argument("--build_rubric_only", action="store_true", default=False)
+    parser.add_argument("--force_rebuild_rubric", action="store_true", default=False)
+    parser.add_argument("--result_file", type=str, default=None)
+    parser.add_argument("--debug", action="store_true", default=False)
+    parser.add_argument("--vllm", action="store_true", default=False)
+    parser.add_argument("--tritonai", action="store_true", default=False,
+                        help="Use OpenAI-compatible LiteLLM proxy (set TRITONAI_API_KEY env var)")
+    parser.add_argument("--nvidia", action="store_true", default=False,
+                        help="Use NVIDIA inference API (set NV_API_KEY env var)")
+    parser.add_argument("--verbose", action=argparse.BooleanOptionalAction, default=True)
+    args = parser.parse_args()
+    if not args.build_rubric_only and not args.hyp_file:
+        parser.error("--hyp_file is required unless --build_rubric_only is set")
+    metric_model = args.eval_model_name
+    deployment_name, api_version = model_zoo[metric_model]
+    references = read_json_or_jsonl(args.ref_file)
+    qid2qdata = {entry["question_id"]: entry for entry in references}
+    qid2qtype = {entry["question_id"]: entry["question_type"] for entry in references}
+    qtypes = set(qid2qtype.values())
+    rubric_data = None
+    rubric_file = args.rubric_file or default_rubric_file(args.ref_file)
+    if args.eval_mode != "legacy" or args.build_rubric_only:
+        rubric_data = load_rubric_file(rubric_file, args.ref_file)
+    if args.build_rubric_only:
+        for entry in tqdm(references, desc="building rubrics"):
+            get_or_build_rubric(entry, rubric_data, rubric_file, deployment_name, api_version, args)
+        print(f"Saved rubric file to {rubric_file}")
+        return
+    result_file = args.result_file or default_result_file(args.hyp_file, metric_model, args.eval_mode)
+    existing = read_existing_jsonl(result_file)
+    hypotheses = read_json_or_jsonl(args.hyp_file)
+    qtype2acc = {t: [] for t in qtypes}
+    qtype2strict = {t: [] for t in qtypes}
+    qtype2partial = {t: [] for t in qtypes}
+    logs = []
+    for entry in tqdm(hypotheses):
+        qid = entry.get("question_id")
+        if qid not in qid2qtype:
+            if qid is not None:
+                print(f"Warning: skipping {qid} as it is not in reference data.")
+            continue
+        if qid in existing and should_skip_existing(existing[qid], args.eval_mode):
+            existing_row = existing[qid]
+            logs.append(existing_row)
+            qtype = qid2qtype[qid]
+            if args.eval_mode == "legacy":
+                label = existing_row["autoeval_label"]["label"]
+                qtype2acc[qtype].append(1 if label else 0)
+            else:
+                atomic = existing_row["autoeval_atomic"]
+                qtype2strict[qtype].append(1 if atomic["strict_label"] else 0)
+                qtype2partial[qtype].append(float(atomic["partial_score"]))
+            continue
+        qdata = qid2qdata[qid]
+        qtype = qdata["question_type"]
+        hyp = entry["hypothesis"]
+        if args.eval_mode == "legacy":
+            prompt = get_anscheck_prompt(qtype, qdata["question"], qdata["answer"], hyp)
+            completion = llm_call(
+                deployment_name,
+                api_version,
+                prompt,
+                debug=args.debug,
+                vllm=args.vllm,
+                tritonai=args.tritonai,
+                nvidia=args.nvidia,
+            )
+            eval_response = completion.choices[0].message.content.strip()
+            last_line = next((l.strip().lower() for l in reversed(eval_response.splitlines()) if l.strip()), "")
+            label = last_line == "yes" or last_line.startswith("yes")
+            row = dict(entry)
+            row["autoeval_label"] = {
+                "model": metric_model,
+                "prompt_version": LEGACY_PROMPT_VERSION,
+                "label": label,
+                "raw_response": eval_response,
+            }
+            logs.append(row)
+            qtype2acc[qtype].append(1 if label else 0)
+            if args.verbose:
+                print(json.dumps({
+                    "question": qdata["question"],
+                    "answer": qdata["answer"],
+                    "hypothesis": hyp,
+                    "autoeval_label": label,
+                }, indent=4), flush=True)
+            append_jsonl(result_file, row)
+            continue
+        rubric = get_or_build_rubric(qdata, rubric_data, rubric_file, deployment_name, api_version, args)
+        try:
+            atomic_eval = judge_atomic(qdata, hyp, rubric, deployment_name, api_version, args)
+        except ValueError as _judge_err:
+            print(f"[WARN] judge_atomic failed for {qdata['question_id']}, writing zero score: {_judge_err}", flush=True)
+            atoms = rubric.get("required_atoms", [])
+            atomic_eval = {
+                "model": deployment_name,
+                "prompt_version": ATOMIC_PROMPT_VERSION,
+                "eval_mode": args.eval_mode,
+                "strict_label": False,
+                "partial_score": 0.0,
+                "required_atoms": atoms,
+                "atom_judgments": [{"id": a["id"], "label": "error", "score": 0.0, "rationale": "judge parse error", "requirement": a.get("requirement", ""), "weight": a.get("weight", 1.0)} for a in atoms],
+                "unsupported_or_contradictory": [],
+                "absence_mismatch": False,
+                "overall_rationale": f"Skipped: judge JSON parse error ({_judge_err})",
+            }
+        row = dict(entry)
+        row["autoeval_atomic"] = atomic_eval
+        logs.append(row)
+        qtype2strict[qtype].append(1 if atomic_eval["strict_label"] else 0)
+        qtype2partial[qtype].append(float(atomic_eval["partial_score"]))
+        if args.verbose:
+            print(json.dumps({
+                "question": qdata["question"],
+                "answer": qdata["answer"],
+                "hypothesis": hyp,
+                "strict_label": atomic_eval["strict_label"],
+                "partial_score": atomic_eval["partial_score"],
+                "atom_judgments": atomic_eval["atom_judgments"],
+            }, indent=4), flush=True)
+        append_jsonl(result_file, row)
+    if args.eval_mode == "legacy":
+        print_legacy_summary(logs, qtype2acc)
+    else:
+        print_atomic_summary(logs, qtype2strict, qtype2partial, args.eval_mode)
+        print(f"Rubric file: {rubric_file}")
+    print("Saved to", result_file)
+if __name__ == "__main__":
+    main()

main.py ADDED Viewed

	@@ -0,0 +1,1717 @@

+import argparse
+import os
+import re
+import json
+import time
+from json import JSONDecodeError
+from datetime import datetime, timedelta
+from typing import List, Dict, Any
+from openai import OpenAI
+try:
+    from openai import AzureOpenAI
+    from azure.identity import ChainedTokenCredential, AzureCliCredential, ManagedIdentityCredential, get_bearer_token_provider
+    # Azure scope for the OAuth bearer-token provider; override per deployment.
+    AZURE_OAUTH_SCOPE = os.environ.get("AZURE_OAUTH_SCOPE", "")
+    if AZURE_OAUTH_SCOPE:
+        credential = get_bearer_token_provider(ChainedTokenCredential(
+            AzureCliCredential(),
+            ManagedIdentityCredential(),
+        ), AZURE_OAUTH_SCOPE)
+    else:
+        credential = None
+except ImportError:
+    AzureOpenAI = None
+    credential = None
+from model_zoo import model_zoo
+from memory import EpisodicMemoryStore, SemanticMemoryStore
+try:
+    import tiktoken
+except ImportError:
+    tiktoken = None
+try:
+    from transformers import AutoTokenizer, PreTrainedTokenizerBase
+except ImportError:
+    AutoTokenizer = None
+    PreTrainedTokenizerBase = ()
+from collections import defaultdict
+def get_hf_tokenizer_for_vllm(model_name: str):
+    return AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=True)
+# Azure OpenAI endpoint (set AZURE_OPENAI_ENDPOINT env var to your deployment URL).
+endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT", "")
+# OpenAI-compatible LiteLLM proxy URL (set LITELLM_BASE_URL env var to your proxy).
+TRITONAI_BASE_URL = os.environ.get("LITELLM_BASE_URL", "")
+# reading cached files
+qid2plan = {}
+if 'plan_cache' in os.environ:
+    plan_cache_file = os.environ['plan_cache']
+    if os.path.exists(plan_cache_file):
+        qid2plan = json.load(open(plan_cache_file))
+else:
+    plan_cache_file = 'response_cache/qa/evolv_mem_v3_plan_cache_gpt5-1'
+veri_reading_log_file = os.environ['reading_cache']
+qid2rel_sess_ids = {}
+if os.path.exists(veri_reading_log_file):
+    qid2rel_sess_ids = json.load(open(veri_reading_log_file))
+# Cache file for retrieval results to avoid re-running expensive retrieval operations.
+# Stores pre-computed search results for questions, including:
+# - Question metadata (id, type, text, answer, dates)
+# - Haystack information (session dates, content, IDs)
+# - Retrieved results with query, ranked items, and evaluation metrics
+# Format: JSONL file where each line contains a complete retrieval result for one question
+retrieved_log_file = None
+if 'ret_cache' in os.environ:
+    retrieved_log_file = os.environ['ret_cache']
+    print("loading existing retrieved results ...")
+    retrieved_data = [json.loads(line) for line in open(retrieved_log_file).readlines()]
+    retrieved_data_dict = {x['question_id']: x for x in retrieved_data}
+valid_sess_set = set(json.load(open("dataset/all_sessions.json")).keys())
+def parse_json(response_content):
+    """Safely parse JSON content from a string response."""
+    candidates = []
+    if '```json' in response_content:
+        start_idx = response_content.find('```json') + 7
+        end_idx = response_content.rfind('```')
+        if end_idx > start_idx:
+            candidates.append(response_content[start_idx:end_idx].strip())
+    # Also try brace-based extraction as fallback
+    brace_start = response_content.find('{')
+    brace_end = response_content.rfind('}') + 1
+    if brace_start >= 0 and brace_end > brace_start:
+        candidates.append(response_content[brace_start:brace_end].strip())
+    for json_block in candidates:
+        try:
+            result = json.loads(json_block)
+            return result
+        except (JSONDecodeError, ValueError):
+            continue
+    print(f"[Warning] Failed to decode JSON from response (all strategies failed)")
+    print(f"[Debug] Raw response content (truncated): {response_content[:500]}")
+    return {}
+def _retryable_status(e):
+    # Try common attributes first — return any extractable HTTP status code
+    status = getattr(e, "status_code", None) or getattr(e, "http_status", None)
+    if status is not None:
+        return int(status)
+    resp = getattr(e, "response", None)
+    if resp is not None and getattr(resp, "status_code", None) is not None:
+        return int(resp.status_code)
+    # Fallback: infer from message text
+    msg = str(e).lower()
+    if "429" in msg or "rate limit" in msg:
+        return 429
+    if "500" in msg or "internal server error" in msg:
+        return 500
+    if "503" in msg or "API Configuration unavailable" in msg:
+        return 503
+    return None
+MAX_CONTEXT_TOKENS = 272_000
+#MAX_CONTEXT_TOKENS = 256_000
+class CharacterEncoder:
+    """Conservative fallback when tokenizer packages are unavailable."""
+    def encode(self, text, **kwargs):
+        return list(text)
+    def decode(self, toks):
+        return "".join(toks)
+def _get_encoder(model_name: str):
+    """
+    Return a token encoder for the given model name.
+    Cached to avoid reloading HF tokenizers on every call.
+    """
+    # Prefer explicit handling for Qwen models first
+    # Note: Adjust the path below if you have the model downloaded locally
+    if AutoTokenizer is not None and any(k in model_name for k in ["Qwen3", "Qwen/", "Qwen"]):
+        try:
+            return AutoTokenizer.from_pretrained(
+                "Qwen/Qwen3-30B-A3B-Instruct-2507",
+                trust_remote_code=True,
+                use_fast=False
+            )
+        except Exception as e:
+            print(f"[WARN] Failed to load Qwen tokenizer: {e}. Falling back to tiktoken.")
+    # For non-Qwen models, rely on tiktoken's mapping when possible
+    if tiktoken is not None:
+        try:
+            return tiktoken.encoding_for_model(model_name)
+        except Exception:
+            # Generic safe fallback
+            return tiktoken.get_encoding("cl100k_base")
+    return CharacterEncoder()
+def _truncate_to_tokens(text, enc, max_tokens) -> str:
+    """
+    Truncates text to the last `max_tokens`.
+    Compatible with both tiktoken and Hugging Face AutoTokenizers.
+    """
+    # Handle Hugging Face Tokenizers
+    if PreTrainedTokenizerBase and isinstance(enc, PreTrainedTokenizerBase):
+        # add_special_tokens=False is crucial here to avoid double counting
+        # or inserting BOS/EOS in the middle of text during length checks
+        toks = enc.encode(text, add_special_tokens=False)
+    # Handle tiktoken
+    else:
+        toks = enc.encode(text, disallowed_special=())
+    if len(toks) <= max_tokens:
+        return text
+    # Keep the tail (usually the most relevant for instructions / recent context)
+    toks = toks[-max_tokens:]
+    return enc.decode(toks)
+def truncate_chat_prompt(tokenizer, messages, max_context, max_output, overhead=256):
+    # Apply the model's chat template so token counting matches the server
+    prompt_text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    input_ids = tokenizer(prompt_text, add_special_tokens=False).input_ids
+    budget = max_context - max_output - overhead
+    if budget < 0:
+        raise ValueError("max_output_tokens + overhead exceeds max_context_tokens")
+    if len(input_ids) > budget:
+        input_ids = input_ids[-budget:]   # or keep the *start* depending on your needs
+        prompt_text = tokenizer.decode(input_ids, skip_special_tokens=False)
+    return prompt_text
+def llm_call(deployment_name: str,
+            api_version: str,
+            _prompt: str,
+            max_context_tokens: int = MAX_CONTEXT_TOKENS,
+            max_output_tokens: int = 1024,
+            extra_overhead_tokens: int = 32,
+            debug: bool = False,
+            vllm: bool = False,
+            tritonai: bool = False,
+            nvidia: bool = False):
+    if nvidia:
+        client = OpenAI(
+            api_key=os.getenv("NV_API_KEY"),
+            base_url="https://inference-api.nvidia.com/v1",
+        )
+    elif tritonai:
+        client = OpenAI(
+            api_key=os.getenv("TRITONAI_API_KEY"),
+            base_url=TRITONAI_BASE_URL,
+        )
+        max_context_tokens = 131_072  # DeepSeek R1 128K context
+        # DeepSeek R1 uses thinking tokens before the answer; raise output budget
+        if max_output_tokens < 4096:
+            max_output_tokens = 4096
+    elif vllm:
+        # Use local vLLM OpenAI-compatible server
+        vllm_base_url = os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1")
+        vllm_api_key = os.getenv("VLLM_API_KEY", "EMPTY")
+        client = OpenAI(
+            base_url=vllm_base_url,
+            api_key=vllm_api_key,
+        )
+        # Override deployment_name with the vLLM-served model name when set
+        # (needed when main model is LiteLLM proxy but reading uses local vLLM)
+        deployment_name = os.getenv("VLLM_MODEL_NAME", deployment_name)
+        # vLLM Qwen3-30B-A3B-Instruct-2507 has 131,072-token context
+        max_context_tokens = 131_072
+    elif debug:
+        client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+    else:
+        client = AzureOpenAI(
+                azure_endpoint=endpoint,
+                azure_ad_token_provider=credential,
+                api_version=api_version,
+        )
+    enc = _get_encoder(deployment_name)
+    # How many tokens we can spend on the input
+    budget = max_context_tokens - max_output_tokens - extra_overhead_tokens
+    if budget < 0:
+        raise ValueError("max_output_tokens + overhead exceeds max_context_tokens")
+    prompt_truncated = _truncate_to_tokens(_prompt, enc, budget)
+    # Strip control characters and fix broken Unicode that break JSON serialization
+    if nvidia or tritonai:
+        prompt_truncated = prompt_truncated.encode('utf-8', errors='replace').decode('utf-8', errors='replace')
+        prompt_truncated = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', prompt_truncated)
+        # Verify it's valid JSON-serializable
+        json.dumps(prompt_truncated)
+    # OpenAI-compatible proxy requires at least one user message; Azure accepts system role
+    msg_role = "user" if (tritonai or nvidia) else "system"
+    kwargs = {
+        'model': deployment_name,
+        'messages':[
+            {"role": msg_role, "content": prompt_truncated}
+        ]
+    }
+    while True:
+        try:
+            completion = client.chat.completions.create(**kwargs)
+            break
+        except Exception as e:
+            from openai import APITimeoutError as _APITimeoutError
+            if isinstance(e, _APITimeoutError):
+                print(f"[WARN] APITimeoutError from LLM; sleeping 30s then retrying...", flush=True)
+                time.sleep(30)
+                continue
+            st = _retryable_status(e)
+            # 404 from LiteLLM proxy/Bedrock is intermittent (model temporarily unavailable)
+            retryable = (429, 500, 503, 403) + ((404,) if tritonai else ())
+            if st in retryable:
+                print(
+                    f"[WARN] q_idx={di} HTTP {st} from LLM; sleeping 60s then retrying...",
+                    flush=True
+                )
+                time.sleep(60)
+                continue
+            # Non-retryable -> re-raise
+            print('One exception captured', repr(e), flush=True)
+            raise
+    #answer = (completion.choices[0].message.content or "").strip()
+    #return answer
+    return completion
+def custom_to_iso8601(time_str):
+    """
+    Convert '2023/04/10 (Mon) 23:07' to '2023-04-10T23:07:00'
+    """
+    # Remove the weekday (e.g., "(Mon)")
+    clean = time_str.split('(')[0].strip() + ' ' + time_str.split(')')[-1].strip()
+    # Parse the cleaned string
+    dt = datetime.strptime(clean, "%Y/%m/%d %H:%M")
+    # Format as ISO 8601
+    return dt.isoformat()
+def evaluate_retrieval(recalled_docs, correct_docs, k=10):
+    #recalled_docs = set(corpus_ids[idx] for idx in rankings[:k])
+    recall_any = float(any(doc in recalled_docs for doc in correct_docs))
+    recall_all = float(all(doc in recalled_docs for doc in correct_docs))
+    return recall_any, recall_all
+def print_average_metrics(retrieval_metric_list):
+    metric_sums = defaultdict(float)
+    metric_counts = defaultdict(int)
+    for metric in retrieval_metric_list:
+        for k, v in metric.items():
+            metric_sums[k] += v
+            metric_counts[k] += 1
+    print("\t\t Average metrics:")
+    for k in sorted(metric_sums):
+        avg = metric_sums[k] / metric_counts[k]
+        print(f"\t\t {k}: {avg:.4f}")
+# Load prompt template
+prompt_path = "prompts/agentic_retrieval_prompt.txt"
+with open(prompt_path, "r", encoding="utf-8") as f:
+    stg_prompt = f.read()
+class ChatHistory:
+    def __init__(self, data: Dict[str, Any] = None, sessions: List = None):
+        assert not (data is not None and sessions is not None), "ChatHistory: Only one of data or sessions may be provided."
+        if data is not None: # From raw data dict
+            self.raw_data = data
+            self.sessions = []
+            self._parse_sessions()
+        elif sessions is not None: # From provided sessions list
+            self.sessions = sessions
+            self.messages = []
+            for sess in self.sessions:
+                session_id = sess['session_id']
+                timestamp = sess['timestamp']
+                for turn_idx, msg in enumerate(sess['session']):
+                    entry = {
+                        "role": msg.get("role"),
+                        "content": msg.get("content"),
+                        "session_id": session_id,
+                        "turn_index": turn_idx,
+                        "timestamp": timestamp,
+                        "iso_datetime": timestamp.isoformat(),
+                        "has_answer": msg.get("has_answer", False)
+                    }
+                    self.messages.append(entry)
+        else:
+            self.sessions = []
+            self.messages = []
+    def get_contents(self, granularity='turn', _format='json') -> list:
+        if granularity == "turn":
+            if _format == "json":
+                return [json.dumps(msg) for msg in self.messages]
+            else:
+                return [msg['content'] for msg in self.messages]
+        else: # granularity == 'session'
+            if _format == "json":
+                return [json.dumps(session) for session in self.sessions]
+            else:
+                return [json.dumps({"role": session["role"], "content": session["content"]})
+                        for session in self.sessions]
+    def to_prompt(self, granularity='session', _format="json"):
+        history_str = ""
+        for session in self.sessions:
+            sess_str = json.dumps([{"role": x["role"], "content": x["content"]} for x in session['session']])
+            history_str += f"Session Date: {session['session_date']}\nSession Content:\n{sess_str}\n"
+        return history_str
+    def get_session_ids(self):
+        return [s['session_id'] for s in self.sessions]
+    @staticmethod
+    def _parse_date(date_str: str) -> datetime:
+        # Convert '2023/04/10 (Mon) 17:50' to datetime
+        # Remove weekday in parentheses
+        date_part, time_part = date_str.split('(')[0].strip(), date_str.split(')')[-1].strip()
+        dt = datetime.strptime(date_part + time_part, "%Y/%m/%d%H:%M")
+        return dt
+    def _parse_sessions(self):
+        """
+        Flattens sessions into a list of messages, each with ISO 8601 date, session ID, and turn index
+        """
+        self.messages = []
+        for date_str, session_id, session, topic in zip(
+            self.raw_data['haystack_dates'],
+            self.raw_data['haystack_session_ids'],
+            self.raw_data['haystack_sessions'],
+            self.raw_data['haystack_topics']
+        ):
+            timestamp = self._parse_date(date_str)
+            for turn_idx, msg in enumerate(session):
+                entry = {
+                    "role": msg.get("role"),
+                    "content": msg.get("content"),
+                    "session_id": session_id,
+                    "turn_index": turn_idx,
+                    "timestamp": timestamp,
+                    "iso_datetime": timestamp.isoformat(),
+                    "session_date": date_str,
+                    "has_answer": msg.get("has_answer", False)
+                }
+                self.messages.append(entry)
+            self.sessions.append({
+                "session_date": date_str,
+                "timestamp": timestamp,
+                "session_id": session_id,
+                "session": session,
+                "topic": topic,
+            })
+        # Optionally, sort by time (ascending)
+        #self.sessions.sort(key=lambda x: x["timestamp"])
+        #self.messages.sort(key=lambda x: x["timestamp"])
+    def __len__(self):
+        return len(self.sessions)
+    def __getitem__(self, idx) -> Dict[str, any]:
+        return self.sessions[idx]  # Return session dict
+    def get_item_by_index(self, idx):
+        if isinstance(idx, range) or isinstance(idx, list):
+            max_idx = len(self.sessions)
+            valid_indices = [i for i in idx if 0 <= i < max_idx]
+            selected_sessions = [self.sessions[i] for i in valid_indices]
+            return ChatHistory(sessions=selected_sessions)
+        else:
+            raise ValueError("Input must be a list or range of indices.")
+    def get_item_by_session_ids(self, sess_set):
+        if not isinstance(sess_set, set):
+            sess_set = set(sess_set)
+        new_sessions = []
+        for sess in self.sessions:
+            if sess['session_id'] in sess_set:
+                new_sessions.append(sess)
+        return ChatHistory(sessions=new_sessions)
+    def get_item_by_ranked_session(self, sess_id_sorted):
+        new_sessions = []
+        for sess_id in sess_id_sorted:
+            for sess in self.sessions:
+                if sess['session_id'] in sess_id:
+                    new_sessions.append(sess)
+        return ChatHistory(sessions=new_sessions)
+    def get_item_by_topics(self, topics):
+        new_sessions = []
+        new_sess_ids = set()
+        for sess in self.sessions:
+            for tp in sess['topic']:
+                if tp in topics and sess['session_id'] not in new_sess_ids:
+                    new_sessions.append(sess)
+                    new_sess_ids.add(sess['session_id'])
+                    break
+        return ChatHistory(sessions=new_sessions)
+    def merge_rel_sess(self, new_sessions):
+        # Gather all current and new sessions in a dict keyed by session_id
+        all_sessions = {s["session_id"]: s for s in self.sessions}
+        # add if new
+        for s in new_sessions.sessions:
+            if s["session_id"] not in all_sessions:
+                all_sessions[s["session_id"]] = s
+        # Reconstruct raw_data for new ChatHistory
+        merged_raw_data = {
+            "haystack_dates": [s["session_date"] for k, s in all_sessions.items()],
+            "haystack_session_ids": [s["session_id"] for k, s in all_sessions.items()],
+            "haystack_sessions": [s["session"] for k, s in all_sessions.items()],
+            "haystack_topics": [s["topic"] for k, s in all_sessions.items()],
+        }
+        self.sessions = ChatHistory(merged_raw_data)
+def generate_keywords(question: str, deployment_name, api_version, debug=False, vllm=None,
+                      tritonai=False, nvidia=False) -> List[str]:
+    # read prompt from `keyword_search_prompt.txt` file
+    with open('prompts/keyword_search_prompt.txt') as f:
+        prompt_template = f.read()
+    prompt = prompt_template + question
+    # Call the LLM to generate keywords
+    completion = llm_call(
+            deployment_name,
+            api_version,
+            prompt,
+            debug=debug,
+            vllm=vllm,
+            tritonai=tritonai,
+            nvidia=nvidia,
+    )
+    response_content = (completion.choices[0].message.content or "").strip()
+    result = parse_json(response_content)
+    keywords = result["keywords"] if "keywords" in result else []
+    return keywords
+def keyword_search(chat_history: ChatHistory, keywords: list):
+    print(f"\t\t** keyword search **: {keywords}")
+    # Gather all messages that match
+    start_time = time.time()
+    matched_msgs = [
+        msg for msg in chat_history.messages
+        if any(kw.lower() in (msg.get("content") or "").lower() for kw in keywords)
+    ]
+    end_time = time.time()
+    execution_time = end_time - start_time
+    if matched_msgs:
+        new_sess_ids = set()
+        for msg in matched_msgs:
+            key = msg["session_id"]
+            new_sess_ids.add(key)
+        new_chat_history = chat_history.get_item_by_session_ids(new_sess_ids)
+    else:
+        new_chat_history = ChatHistory()
+    return new_chat_history
+def is_turn_id(text):
+    pattern = r'_\d+$'
+    return bool(re.search(pattern, text))
+def embedding_search(chat_history: ChatHistory, qid: str, top_k: int = 50, exclude_sess=None):
+    print("\t\t** embedding based retrieval **")
+    new_sess_ids = []
+    curr_all_sess = set(chat_history.get_session_ids())
+    if exclude_sess:
+        curr_all_sess = curr_all_sess - set(exclude_sess)
+    for item in retrieved_data_dict[qid]["retrieval_results"]["ranked_items"]:
+        if item["corpus_id"] in valid_sess_set:
+            sid = item["corpus_id"]
+        else:
+            tokens = item["corpus_id"].split("_")
+            if "_turn" in item["corpus_id"]:
+                sid = item["corpus_id"].split("_turn")[0]
+            elif "_fact" in item["corpus_id"]:
+                sid = item["corpus_id"].split("_fact")[0]
+            elif "noans" in item["corpus_id"]:
+                sid = item["corpus_id"].replace("noans", "answer")
+            elif is_turn_id(item["corpus_id"]):
+                sid = "_".join(tokens[:-1]) # remove turn index
+            else:
+                sid = item["corpus_id"]
+            if sid not in valid_sess_set:
+                print(item["corpus_id"], sid)
+            assert sid in valid_sess_set
+        if sid in curr_all_sess:
+            new_sess_ids.append(sid)
+            if len(new_sess_ids) == top_k:
+                break
+    new_chat_history = chat_history.get_item_by_ranked_session(new_sess_ids)
+    return new_chat_history
+def filter_out_by_embedding(chat_history: ChatHistory, qid: str, top_k: int = 50):
+    print("\t\t** [filter_out] embedding based retrieval - loading existing results ...")
+    new_sess_ids = []
+    curr_all_sess = set(chat_history.get_session_ids())
+    for item in retrieved_data_dict[qid]["retrieval_results"]["ranked_items"]:
+        if item["corpus_id"] in valid_sess_set:
+            sid = item["corpus_id"]
+        else:
+            tokens = item["corpus_id"].split("_")
+            if "_turn" in item["corpus_id"]:
+                sid = item["corpus_id"].split("_turn")[0]
+            elif "_fact" in item["corpus_id"]:
+                sid = item["corpus_id"].split("_fact")[0]
+            elif "noans" in item["corpus_id"]:
+                sid = item["corpus_id"].replace("noans", "answer")
+            elif is_turn_id(item["corpus_id"]):
+                sid = "_".join(tokens[:-1]) # remove turn index
+            else:
+                sid = item["corpus_id"]
+            if sid not in valid_sess_set:
+                print(item["corpus_id"], sid)
+            assert sid in valid_sess_set
+        if sid in curr_all_sess:
+            new_sess_ids.append(sid)
+            if len(new_sess_ids) == top_k:
+                break
+    new_chat_history = chat_history.get_item_by_ranked_session(new_sess_ids)
+    return new_chat_history, 0.0
+def flat_embedding_top_k_ids(qid: str, haystack_sess_ids: List[str], top_k: int) -> List[str]:
+    """
+    Pull the top_k session IDs from the global GTE retrieval cache (retrieved_data_dict),
+    constrained to the question's haystack. Mirrors embedding_search() but operates on
+    IDs only (no ChatHistory). Used by hier_union to widen the Stage-2 pool.
+    """
+    haystack_set = set(haystack_sess_ids)
+    ids: List[str] = []
+    for item in retrieved_data_dict[qid]["retrieval_results"]["ranked_items"]:
+        cid = item["corpus_id"]
+        if cid in valid_sess_set:
+            sid = cid
+        elif "_turn" in cid:
+            sid = cid.split("_turn")[0]
+        elif "_fact" in cid:
+            sid = cid.split("_fact")[0]
+        elif "noans" in cid:
+            sid = cid.replace("noans", "answer")
+        elif is_turn_id(cid):
+            sid = "_".join(cid.split("_")[:-1])
+        else:
+            sid = cid
+        if sid in haystack_set and sid not in ids:
+            ids.append(sid)
+            if len(ids) == top_k:
+                break
+    return ids
+def semantic_embedding_search(
+    qid: str,
+    haystack_sess_ids: List[str],
+    semantic_retrieved_dict: dict,
+    top_k: int = 50,
+) -> List[str]:
+    """
+    Like embedding_search() but reads from the pre-computed semantic-gte retrieval cache.
+    Returns an ordered list of up to top_k session IDs from the haystack.
+    """
+    print("\t\t** semantic embedding retrieval **")
+    haystack_set = set(haystack_sess_ids)
+    ranked_ids: List[str] = []
+    for item in semantic_retrieved_dict[qid]["retrieval_results"]["ranked_items"]:
+        sid = item["corpus_id"]  # already session-level (no turn suffix)
+        if sid in haystack_set and sid not in ranked_ids:
+            ranked_ids.append(sid)
+            if len(ranked_ids) == top_k:
+                break
+    return ranked_ids
+def time_filter(chat_history: ChatHistory, start_date: str, end_date: str) -> ChatHistory:
+    # Returns all messages with timestamp in the ISO date range [start_date, end_date] (inclusive).
+    start_time = time.time()
+    try:
+        start = datetime.fromisoformat(start_date)
+        end = datetime.fromisoformat(end_date)
+        filtered_msgs = [msg for msg in chat_history.messages if start.date() <= msg["timestamp"].date() <= end.date()]
+    except Exception as e:
+        print("Converting date error: ", e)
+        filtered_msgs = []
+    end_time = time.time()
+    execution_time = end_time - start_time
+    if filtered_msgs:
+        new_sess_ids = set()
+        for msg in filtered_msgs:
+            key = msg["session_id"]
+            new_sess_ids.add(key)
+        new_chat_history = chat_history.get_item_by_session_ids(new_sess_ids)
+    else:
+        new_chat_history = ChatHistory()
+    return new_chat_history
+class RetrievalAgent:
+    def __init__(
+        self,
+        history: List[Dict],
+        topics: List[str],
+        user_profile: str = None,
+        debug: bool = False,
+        vllm: bool = False,
+        vllm_reading: bool = False,
+        tritonai: bool = False,
+        nvidia: bool = False,
+        n_chunks: int = 10,
+        topic_filter: bool = True,
+        no_time_filter: bool = False,
+        semantic_store: SemanticMemoryStore = None,
+        episodic_store: EpisodicMemoryStore = None,
+        hier_v2: bool = False,
+        hier_union: bool = False,
+        hier_union_flat_k: int = 20,
+        no_early_answer: bool = False,
+    ):
+        self.chat_history = history
+        self.user_profile = user_profile
+        self.topics = topics
+        self.rel_sess = ChatHistory()
+        self.evidence = []
+        self.debug = debug
+        self.vllm = vllm
+        self.vllm_reading = vllm_reading  # use vLLM only for _read_and_verify
+        self.tritonai = tritonai          # use LiteLLM proxy for non-reading LLM calls
+        self.nvidia = nvidia              # use NVIDIA inference API
+        self.no_time_filter = no_time_filter  # skip time_filter steps in strategy
+        self.n_chunks = n_chunks
+        self.topic_filter = topic_filter
+        self.semantic_store = semantic_store
+        self.episodic_store = episodic_store
+        self.hier_v2 = hier_v2
+        self.hier_union = hier_union
+        self.hier_union_flat_k = hier_union_flat_k
+        self.no_early_answer = no_early_answer
+        self.token_budget = {
+            'planning':             {'prompt_tokens': 0, 'completion_tokens': 0, 'n_calls': 0},
+            'verification_reading': {'prompt_tokens': 0, 'completion_tokens': 0, 'n_calls': 0},
+            'is_answerable':        {'prompt_tokens': 0, 'completion_tokens': 0, 'n_calls': 0},
+            'final_answer':         {'prompt_tokens': 0, 'completion_tokens': 0, 'n_calls': 0},
+        }
+        if self.user_profile:
+            with open('prompts/read_and_extract_prompt.txt') as f:
+                self.read_prompt_template = f.read()
+        else: # ablation: wo_profile
+            with open('prompts/agentic_retrieval_prompt_wo_profile.txt') as f:
+                self.read_prompt_template = f.read()
+    def _track_usage(self, component: str, completion) -> None:
+        """Accumulate prompt/completion token counts for a named component."""
+        usage = getattr(completion, 'usage', None)
+        if usage is None:
+            return
+        self.token_budget[component]['prompt_tokens']     += getattr(usage, 'prompt_tokens', 0) or 0
+        self.token_budget[component]['completion_tokens'] += getattr(usage, 'completion_tokens', 0) or 0
+        self.token_budget[component]['n_calls']           += 1
+    def get_token_budget(self) -> dict:
+        """Return token_budget with an added 'total' entry."""
+        total = {'prompt_tokens': 0, 'completion_tokens': 0, 'n_calls': 0}
+        for v in self.token_budget.values():
+            total['prompt_tokens']     += v['prompt_tokens']
+            total['completion_tokens'] += v['completion_tokens']
+            total['n_calls']           += v['n_calls']
+        return {**self.token_budget, 'total': total}
+    def is_answerable(self, question: str, question_date: str, retrieved_sess, evidence, model_info, context_str: str = None) -> bool:
+        context = ""
+        for k, v in evidence.items(): # key: "profile", "tags", "chat_clues"
+            for e in v:
+                context += f"{e}\n"
+        # context_str overrides retrieved_sess.to_prompt() (used in Stage 1 with semantic context)
+        sess_str = context_str if context_str is not None else retrieved_sess.to_prompt()
+        # Include user profile in the prompt when available
+        profile_section = ""
+        if self.user_profile:
+            profile_section = f"\nUser Profile:\n{self.user_profile}\n"
+        ia_prompt_prefix = f"""
+You are a decision-making agent tasked with determining when sufficient information has been gathered to answer a user's question.
+Your Task:
+Analyze the provided question, current date, available memory context, and available evidence to make a binary decision: Answerable or Not answerable. If the information is not sufficient, explain what specific information is needed to provide hints for the next retrieval stage.
+Question: {question}
+Current Date: {question_date}
+{profile_section}
+Memory Context:
+"""
+        output_str = """
+Output (always JSON — choose fields per the rules above)
+Case 1 — Answerable:
+{
+  "is_answerable": true,
+  "answer": "<concise answer grounded strictly in the Evidence>"
+}
+Case 2 — Not answerable:
+{
+  "is_answerable": false,
+  "info_needed": ["<specific missing detail 1>", "<specific missing detail 2>"]
+}
+"""
+        deployment_name, api_version = model_info
+        # ------------------------------------------------------------------
+        # Token-based truncation: keep the *end* of sess_str under budget
+        # ------------------------------------------------------------------
+        enc = _get_encoder(deployment_name)
+        # Model/context limits
+        if self.vllm or self.tritonai:
+            # Large context for vLLM / LiteLLM proxy models
+            model_max_ctx = 131_072
+        else:
+            model_max_ctx = MAX_CONTEXT_TOKENS
+        max_output_tokens = 1024
+        extra_overhead_tokens = 32
+        # Total budget available for input tokens
+        budget = model_max_ctx - max_output_tokens - extra_overhead_tokens
+        if budget <= 0:
+            raise ValueError(
+                f"max_output_tokens ({max_output_tokens}) + overhead "
+                f"({extra_overhead_tokens}) exceeds model_max_ctx ({model_max_ctx})"
+            )
+        # Token lengths of static pieces
+        prefix_tokens = enc.encode(ia_prompt_prefix, disallowed_special=())
+        output_tokens = enc.encode(output_str, disallowed_special=())
+        sess_tokens = enc.encode(sess_str, disallowed_special=())
+        # Budget for sess_str tokens
+        available_for_sess = budget - len(prefix_tokens) - len(output_tokens)
+        if available_for_sess <= 0:
+            # No room for history at all; drop it
+            truncated_sess_str = ""
+        else:
+            if len(sess_tokens) > available_for_sess:
+                # Keep the *last* available_for_sess tokens (drop oldest history)
+                truncated_sess_tokens = sess_tokens[-available_for_sess:]
+                truncated_sess_str = enc.decode(truncated_sess_tokens)
+            else:
+                truncated_sess_str = sess_str
+        ia_prompt = ia_prompt_prefix + truncated_sess_str + "\n"
+        # ------------------------------------------------------------------
+        # Call the LLM with the already-truncated prompt
+        # ------------------------------------------------------------------
+        completion = llm_call(
+            deployment_name,
+            api_version,
+            ia_prompt + output_str,
+            max_context_tokens=model_max_ctx,  # matches what we used for budgeting
+            max_output_tokens=max_output_tokens,
+            extra_overhead_tokens=extra_overhead_tokens,
+            debug=self.debug,
+            vllm=self.vllm,
+            tritonai=self.tritonai,
+            nvidia=self.nvidia,
+        )
+        self._track_usage('is_answerable', completion)
+        response_content = (completion.choices[0].message.content or "").strip()
+        print(f"\t\t[Agent] is_answerable: {response_content}")
+        result = parse_json(response_content)
+        if not result:
+            print("[Warning] Empty or invalid JSON in is_answerable() response.")
+            return {"is_answerable": False, "info_needed": ["Parsing failed"]}
+        return result
+    def _read_and_verify(self, question: str, question_date: str, evidence: ChatHistory, n_chunks=10) -> ChatHistory:
+        # Read evidence and and select
+        relevant_indices = []
+        evidence_list = []
+        max_idx = len(evidence)
+        for j in range(0, len(evidence), n_chunks):
+            chunk_range = range(j, j+n_chunks)
+            valid_indices = [i for i in chunk_range if 0 <= i < max_idx]
+            cur_chunk  = evidence.get_item_by_index(valid_indices)
+            cur_chunk_sess = [[{"role": m["role"], "content": m["content"]} for m in sess['session']]
+                                for sess in cur_chunk.sessions]
+            cur_chunk_sess_date = [sess['session_date'] for sess in cur_chunk.sessions]
+            sess_input_str = "\n".join([
+                f"### Session Index: {i}\n### Session Date: {sess_date}\n\n{json.dumps(sess)}\n"
+                for i, (sess, sess_date) in enumerate(zip(cur_chunk_sess, cur_chunk_sess_date))
+            ])
+            _prompt = self.read_prompt_template + f"## Question: {question}\n## Question Date: {question_date}\n## Session list:\n\n{sess_input_str}\nNow, identify **only the sessions strictly necessary to answer the question**."
+            completion = llm_call(
+                deployment_name,
+                api_version,
+                _prompt,
+                debug=self.debug,
+                vllm=self.vllm or self.vllm_reading,
+                nvidia=self.nvidia,
+            )
+            self._track_usage('verification_reading', completion)
+            response_content = (completion.choices[0].message.content or "").strip()
+            print(f"\t\t {valid_indices[0]}~{valid_indices[-1]}: response: {response_content.replace(chr(10), '')}")
+            try:
+                start_idx = response_content.rfind('{')
+                end_idx = response_content.rfind('}') + 1
+                json_block = response_content[start_idx:end_idx]
+                result = json.loads(json_block)
+                if "index" in result and result['index'] and 'evidence' and result:
+                    relevant_indices.extend([j + idx for idx in result['index']])
+                    evidence_list.extend(result['evidence'])
+            except Exception as e:
+                print(f"Error parsing LLM response: {e}")
+        if relevant_indices:
+            return evidence.get_item_by_index(relevant_indices), evidence_list
+        else:
+            return ChatHistory(), []
+    def _read_and_verify_with_cache(self, qid:str, pool):
+        relevant_sess_ids = []
+        for sess in pool.sessions:
+            sess_id = sess['session_id']
+            if sess_id in qid2rel_sess_ids[qid]:
+                relevant_sess_ids.append(sess_id)
+        if len(relevant_sess_ids) > 0:
+            return pool.get_item_by_session_ids(relevant_sess_ids), []
+        else:
+            return ChatHistory(), []
+    def _plan(self, query: str, query_date: str, attempt_record: list, model_info) -> str:
+        if self.user_profile:
+            template = """
+### User profile: {user_profile}
+### Chat history topics: {topics}
+### User query: {query}
+### User query date: {query_date}
+### Previous attempts:
+{strategies_info}
+"""
+        else:
+            template = """
+### Chat history topics: {topics}
+### User query: {query}
+### User query date: {query_date}
+### Previous attempts:
+{strategies_info}
+"""
+        if attempt_record:
+            strategies_info = ""
+            for loop_num, entry in enumerate(attempt_record):
+                strategies_info += f"\nloop_iteration: {loop_num+1}\n"
+                strategies_info += "\n".join(entry.get('step_logs', []))
+                if 'n_retrieved_sess' in entry and 'evidence' in entry:
+                    strategies_info += f"Retrieved {entry['n_retrieved_sess']} docs, observed_evidence: {entry['evidence']}"
+                    if entry['n_retrieved_sess'] == 0:
+                        strategies_info += f"Additional Instruction: Re-try without filter methods if the previous paln includes topics or time-filtering\n"
+        else:
+            strategies_info = "(No previous attempt exists)"
+        if self.user_profile:
+            prompt_filled = stg_prompt + template.format(
+                user_profile=self.user_profile,
+                topics=",".join(self.topics),
+                query=query,
+                query_date=query_date,
+                strategies_info=strategies_info)
+        else:
+            prompt_filled = stg_prompt + template.format(
+                topics=",".join(self.topics),
+                query=query,
+                query_date=query_date,
+                strategies_info=strategies_info)
+        deployment_name, api_version = model_info
+        completion = llm_call(
+            deployment_name,
+            api_version,
+            prompt_filled,
+            debug=self.debug,
+            vllm=self.vllm,
+            tritonai=self.tritonai,
+            nvidia=self.nvidia,
+            )
+        self._track_usage('planning', completion)
+        response_content = (completion.choices[0].message.content or "").strip()
+        _plan = parse_json(response_content)
+        if not _plan:
+            print("[Warning] Failed to parse plan JSON — retrying once.")
+            completion = llm_call(
+                deployment_name,
+                api_version,
+                prompt_filled,
+                debug=self.debug,
+                vllm=self.vllm,
+                tritonai=self.tritonai,
+                nvidia=self.nvidia,
+            )
+            self._track_usage('planning', completion)
+            response_content = (completion.choices[0].message.content or "").strip()
+            _plan = parse_json(response_content)
+        if not _plan:
+            print("[Warning] Failed to parse plan JSON after retry — returning fallback plan.")
+            _plan = {"answer": "none", "reason": "invalid JSON response", "topics": [], "strategy": []}
+        return _plan
+    def _run_stage1(
+        self,
+        qid: str,
+        question: str,
+        question_date: str,
+        top_k: int,
+        model_info,
+        haystack_sess_ids: List[str],
+        date_lookup: Dict[str, str],
+        semantic_ret_dict: dict,
+    ) -> dict:
+        """
+        Stage 1: retrieve and evaluate using semantic memory only (summaries + facts).
+        Returns a dict with:
+            is_answerable   : bool
+            answer          : str | None   (set when is_answerable is True)
+            candidate_ids   : list[str]    (top-K session IDs from semantic retrieval)
+            attempt_record  : list
+        """
+        print(f"\t[Stage 1] Semantic memory retrieval for qid={qid}")
+        # --- 1a. Semantic embedding search ---
+        candidate_ids = semantic_embedding_search(
+            qid, haystack_sess_ids, semantic_ret_dict, top_k=top_k
+        )
+        # hier_v2: skip plan / keyword / time_filter / is_answerable. Stage 1 is candidate-only.
+        if self.hier_v2:
+            print(f"\t[Stage 1 hier_v2] embedding-only candidates: {len(candidate_ids)}")
+            return {
+                "is_answerable": False,
+                "answer": None,
+                "candidate_ids": candidate_ids[:top_k],
+                "attempt_record": [{
+                    "stage": "semantic_v2",
+                    "plan": {},
+                    "n_candidates": len(candidate_ids),
+                    "candidate_ids": candidate_ids[:top_k],
+                }],
+            }
+        # --- 1b. Plan for keywords / time filter (reuse existing _plan) ---
+        plan = self._plan(question, question_date, [], model_info)
+        print(json.dumps(plan, indent=4), flush=True)
+        # If the planner already has a direct answer, return it
+        if "answer" in plan and plan["answer"].lower() != "none":
+            return {
+                "is_answerable": True,
+                "answer": plan["answer"],
+                "candidate_ids": candidate_ids,
+                "attempt_record": [{"plan": plan, "stage": "semantic"}],
+            }
+        # --- 1c. Semantic keyword search ---
+        keyword_ids: List[str] = []
+        for step in plan.get("strategy", []):
+            if step.get("method") == "keyword":
+                kws = step.get("keywords", [])
+                matched = self.semantic_store.keyword_search(kws, haystack_sess_ids)
+                print(f"\t\t** semantic keyword search **: {kws} -> {len(matched)} matches")
+                keyword_ids.extend(sid for sid in matched if sid not in keyword_ids)
+        # --- 1d. Time filter on candidate set ---
+        for step in plan.get("strategy", []):
+            if self.no_time_filter:
+                break  # skip all time_filter steps
+            if step.get("method") == "time_filter":
+                if "time_range" not in step or len(step["time_range"]) != 2:
+                    continue
+                start_str, end_str = step["time_range"]
+                from datetime import datetime
+                try:
+                    start_dt = datetime.fromisoformat(start_str)
+                    end_dt = datetime.fromisoformat(end_str)
+                    candidate_ids = [
+                        sid for sid in candidate_ids
+                        if sid in date_lookup and
+                        start_dt.date() <= EpisodicMemoryStore._parse_date(date_lookup[sid]).date() <= end_dt.date()
+                    ]
+                    keyword_ids = [
+                        sid for sid in keyword_ids
+                        if sid in date_lookup and
+                        start_dt.date() <= EpisodicMemoryStore._parse_date(date_lookup[sid]).date() <= end_dt.date()
+                    ]
+                    print(f"\t\t** semantic time_filter **: {start_str}..{end_str} -> "
+                          f"{len(candidate_ids)} embed, {len(keyword_ids)} keyword")
+                except Exception as e:
+                    print(f"\t\t[WARN] time_filter parse error: {e}")
+        # --- 1e. Merge candidates (keyword union with embedding, preserve rank) ---
+        all_candidate_ids: List[str] = list(candidate_ids)
+        for sid in keyword_ids:
+            if sid not in all_candidate_ids:
+                all_candidate_ids.append(sid)
+        # Cap to top_k
+        all_candidate_ids = all_candidate_ids[:top_k]
+        if not all_candidate_ids:
+            print("\t[Stage 1] No candidates found in semantic memory.")
+            return {
+                "is_answerable": False,
+                "answer": None,
+                "candidate_ids": [],
+                "attempt_record": [{"plan": plan, "stage": "semantic", "n_candidates": 0}],
+            }
+        # --- 1f. Build semantic context string ---
+        semantic_context_str = self.semantic_store.to_prompt(all_candidate_ids, date_lookup)
+        print(f"\t[Stage 1] Built semantic context for {len(all_candidate_ids)} sessions "
+              f"({len(semantic_context_str)} chars)")
+        # --- 1g. is_answerable check on semantic context ---
+        accumulated_evidence = {"profile": [], "chat_clues": []}
+        answerable_response = self.is_answerable(
+            question, question_date,
+            retrieved_sess=None,
+            evidence=accumulated_evidence,
+            model_info=model_info,
+            context_str=semantic_context_str,
+        )
+        print(f"\t[Stage 1] is_answerable: {answerable_response}")
+        attempt_record = [{
+            "stage": "semantic",
+            "plan": plan,
+            "n_candidates": len(all_candidate_ids),
+            "candidate_ids": all_candidate_ids,
+            "is_answerable": answerable_response.get("is_answerable", False),
+        }]
+        if answerable_response.get("is_answerable"):
+            return {
+                "is_answerable": True,
+                "answer": answerable_response.get("answer"),
+                "candidate_ids": all_candidate_ids,
+                "attempt_record": attempt_record,
+            }
+        print(f"\t[Stage 1] Not answerable from semantic memory. "
+              f"Info needed: {answerable_response.get('info_needed', [])}")
+        return {
+            "is_answerable": False,
+            "answer": None,
+            "candidate_ids": all_candidate_ids,
+            "attempt_record": attempt_record,
+        }
+    def run(self, qid:str, question: str, question_date: str, top_k: int, model_info, max_loops=3,
+            semantic_ret_dict: dict = None, haystack_sess_ids: List[str] = None,
+            date_lookup: Dict[str, str] = None, topic_lookup: Dict[str, List[str]] = None):
+        accumulated_evidence = {"profile": [], "chat_clues": []}
+        attempt_record = []
+        loop_num = 0
+        # ----------------------------------------------------------------
+        # Stage 1: Semantic memory — only runs when stores are provided
+        # ----------------------------------------------------------------
+        stage1_candidate_ids: List[str] = []
+        if (self.semantic_store is not None
+                and semantic_ret_dict is not None
+                and haystack_sess_ids is not None):
+            stage1_result = self._run_stage1(
+                qid, question, question_date, top_k, model_info,
+                haystack_sess_ids, date_lookup or {}, semantic_ret_dict,
+            )
+            attempt_record.extend(stage1_result["attempt_record"])
+            stage1_candidate_ids = stage1_result["candidate_ids"]
+            if stage1_result["is_answerable"] and not self.no_early_answer:
+                print(f"\t[Stage 1] Answered from semantic memory.")
+                # Wrap answer in attempt_record format expected by caller
+                if "answer" in stage1_result and stage1_result["answer"]:
+                    attempt_record[0]["plan"] = {
+                        **attempt_record[0].get("plan", {}),
+                        "answer": stage1_result["answer"],
+                    }
+                return ChatHistory(), attempt_record
+            if stage1_result["is_answerable"] and self.no_early_answer:
+                print(f"\t[Stage 1] is_answerable=True but --no_early_answer set; proceeding to Stage-2.")
+            # hier_union: widen Stage-2 pool with flat-embedding top-K from the global GTE cache.
+            # This makes hier a strict superset of flat by construction; targets the recall gap
+            # (semantic-over-summary embeddings rank worse than full-session embeddings).
+            if self.hier_union and qid in retrieved_data_dict:
+                flat_ids = flat_embedding_top_k_ids(qid, haystack_sess_ids, self.hier_union_flat_k)
+                before = len(stage1_candidate_ids)
+                for sid in flat_ids:
+                    if sid not in stage1_candidate_ids:
+                        stage1_candidate_ids.append(sid)
+                print(f"\t[hier_union] semantic_top_k={before} + flat_top_{self.hier_union_flat_k}={len(flat_ids)} -> union={len(stage1_candidate_ids)}")
+                attempt_record.append({
+                    "stage": "hier_union",
+                    "plan": {},
+                    "n_semantic": before,
+                    "n_flat": len(flat_ids),
+                    "n_union": len(stage1_candidate_ids),
+                })
+            # Stage 2: load episodic sessions only for top-K candidates
+            if self.episodic_store is not None and stage1_candidate_ids:
+                print(f"\t[Stage 2] Loading episodic memory for "
+                      f"{len(stage1_candidate_ids)} candidate sessions.")
+                raw_sessions = self.episodic_store.get_raw_sessions(
+                    stage1_candidate_ids, date_lookup or {}, topic_lookup
+                )
+                self.chat_history = ChatHistory(sessions=raw_sessions)
+                print(f"\t[Stage 2] Loaded {len(self.chat_history)} sessions into episodic pool.")
+                # hier_v2: skip the agent loop. Use raw turns of the candidate sessions directly.
+                # Rationale: the agent loop's verification can reject all of a 20-session pool,
+                # leaving empty retrieved. Strong models answer better from raw turns of K=20
+                # semantic-selected sessions than from over-aggressive verification.
+                if self.hier_v2:
+                    print(f"\t[hier_v2] bypassing agent loop; returning {len(self.chat_history)} candidate sessions as retrieved")
+                    return self.chat_history, attempt_record
+        pool = self.chat_history
+        retrieved = ChatHistory()
+        while loop_num < max_loops:
+            loop_num += 1
+            if loop_num == 1 and qid in qid2plan:
+                plan = qid2plan[qid]
+            else:
+                plan = self._plan(question, question_date, attempt_record, model_info)
+                qid2plan[qid] = plan
+            print(json.dumps(plan, indent=4), flush=True)
+            if "answer" in plan and not ("none" in plan["answer"].lower()):
+                print(f"{qid}\t{question}\t{plan['answer']}", flush=True)
+                return ChatHistory(), [{"plan": plan}]
+            # 2) Execute the plan -> retrieve candidates
+            try:
+                candidates, step_logs = self._execute_strategy(pool, plan, question)
+            except Exception as e:
+                print(f"[Error] Failed during _execute_strategy: {e}")
+                candidates, step_logs = ChatHistory(), [f"Execution failed: {e}"]
+            if len(candidates) == 0:
+                attempt_record.append({
+                    "loop_iteration": loop_num,
+                    "plan": plan,
+                    "evidence": accumulated_evidence,
+                    "n_candidates_sess": len(candidates),
+                    "n_verified_sess": 0,
+                    "n_pool": len(pool),
+                    "step_logs": step_logs,
+                })
+                continue
+            else:
+                remaining = set(pool.get_session_ids()) - set(candidates.get_session_ids())
+                pool = self.chat_history.get_item_by_session_ids(remaining)
+            # 3) Verification Reading
+            if qid in qid2rel_sess_ids:
+                verified, evidence_list = self._read_and_verify_with_cache(qid, candidates)
+            else:
+                verified, evidence_list = self._read_and_verify(question, question_date, candidates, n_chunks=self.n_chunks)
+                qid2rel_sess_ids[qid] = verified.get_session_ids()
+            if len(verified) == 0:
+                attempt_record.append({
+                    "loop_iteration": loop_num,
+                    "plan": plan,
+                    "evidence": accumulated_evidence,
+                    "n_candidates_sess": len(candidates),
+                    "candidates_sess_ids": candidates.get_session_ids(),
+                    "n_verified_sess": len(verified),
+                    "verified_sess_ids": [],
+                    "n_pool": len(pool),
+                    "step_logs": step_logs,
+                })
+                continue
+            else:
+                retrieved.merge_rel_sess(verified)
+                for ev in evidence_list:
+                    if ev not in accumulated_evidence['chat_clues']:
+                        accumulated_evidence['chat_clues'].append(ev)
+                attempt_record.append({
+                    "loop_iteration": loop_num,
+                    "plan": plan,
+                    "evidence": accumulated_evidence,
+                    "n_candidates_sess": len(candidates),
+                    "candidates_sess_ids": candidates.get_session_ids(),
+                    "n_verified_sess": len(verified),
+                    "verified_sess_ids": verified.get_session_ids(),
+                    "n_retrieved_sess": len(retrieved),
+                    "retrieved_sess_ids": retrieved.get_session_ids(),
+                    "n_pool": len(pool),
+                    "step_logs": step_logs,
+                })
+            # 4) Decide if continue or not
+            if len(retrieved) > top_k:
+                retrieved, _ = filter_out_by_embedding(retrieved, qid=qid, top_k=top_k)
+            answerable_response = self.is_answerable(question, question_date, retrieved, accumulated_evidence, model_info)
+            if answerable_response["is_answerable"]:
+                plan["answer"] = answerable_response["answer"]
+                return retrieved, attempt_record
+            if len(pool) == 0:
+                break
+        return retrieved, attempt_record
+    def _execute_strategy(self, pool, plan, question):
+        step_logs: List[str] = []
+        # Start from all chat items
+        if self.topic_filter and len(plan.get('topics', [])) > 0:
+            pool = pool.get_item_by_topics(plan['topics'])
+        retrieved = ChatHistory()
+        strategy = plan["strategy"]
+        for step in strategy:
+            method = step.get("method")
+            if method == "keyword":
+                kws = step.get("keywords", [])
+                matched = keyword_search(pool, kws)
+                step_logs.append(f"Method: keyword - matched {len(matched)}/{len(pool)} using {kws}")
+                if len(matched) > 0:
+                    retrieved.merge_rel_sess(matched)
+            elif method == "embedding":
+                top_k = 50
+                matched = embedding_search(pool, qid, top_k=top_k)
+                step_logs.append(f"Method: embedding - top_k={top_k}, matched {len(matched)}/{len(pool)}")
+                if len(matched) > 0:
+                    retrieved.merge_rel_sess(matched)
+            elif method == "time_filter":
+                if self.no_time_filter:
+                    step_logs.append(f"Method: time_filter - skipped (--no_time_filter)")
+                    continue
+                if 'time_range' not in step or len(step['time_range']) != 2:
+                    continue
+                if len(retrieved) > 0:
+                    retrieved = time_filter(retrieved, start_date=step['time_range'][0], end_date=step['time_range'][1])
+                else:
+                    retrieved = time_filter(pool, start_date=step['time_range'][0], end_date=step['time_range'][1])
+                step_logs.append(f"Method: time_filter - kept {len(retrieved)}/{len(pool)} in {step['time_range'][0]}..{step['time_range'][1]}")
+                #if len(matched) > 0:
+                #    retrieved.merge_rel_sess(matched)
+            else:
+                step_logs.append(f"unknown method: {method}")
+        if len(retrieved) > 100:
+            top_k = 100
+            retrieved = embedding_search(retrieved, qid, top_k=top_k)
+            step_logs.append(
+                f"too many sess ({len(pool)}) - embedding top_k={top_k} matched {len(retrieved)}/{len(pool)}"
+            )
+        return retrieved, step_logs
+    def merge_rel_sess(self, new_sessions: ChatHistory):
+        # Gather all current and new sessions in a dict keyed by session_id
+        all_sessions = {s["session_id"]: s for s in self.rel_sess.sessions}
+        # add if new
+        for s in new_sessions.sessions:
+            if s["session_id"] not in all_sessions:
+                all_sessions[s["session_id"]] = s
+        # Optional: sort sessions by timestamp for consistent ordering
+        #merged_sessions = list(all_sessions.values())
+        #merged_sessions.sort(key=lambda x: x["timestamp"])
+        # Reconstruct raw_data for new ChatHistory
+        merged_raw_data = {
+            "haystack_dates": [s["session_date"] for k, s in all_sessions.items()],
+            "haystack_session_ids": [s["session_id"] for k, s in all_sessions.items()],
+            "haystack_sessions": [s["session"] for k, s in all_sessions.items()],
+        }
+        self.rel_sess = ChatHistory(merged_raw_data)
+    def merge_evidence(self, new_evidence: list):
+        self.evidence = self.evidence + new_evidence
+        print(f"\t\t Updated evidence: {self.evidence}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--in_file', type=str, required=True)
+    parser.add_argument('--out_file', type=str, required=True)
+    parser.add_argument('--model_name', type=str, required=True)
+    parser.add_argument('--top_k', type=int, required=True)
+    parser.add_argument('--debug', action='store_true', default=False)
+    parser.add_argument('--vllm', action='store_true', default=False)
+    parser.add_argument('--tritonai', action='store_true', default=False,
+                        help='Use OpenAI-compatible LiteLLM proxy for non-reading LLM calls (set TRITONAI_API_KEY)')
+    parser.add_argument('--nvidia', action='store_true', default=False,
+                        help='Use NVIDIA inference API (set NV_API_KEY)')
+    parser.add_argument('--vllm_reading', action='store_true', default=False,
+                        help='Use vLLM only for verification reading; all other LLM calls use the proprietary API')
+    parser.add_argument('--n_chunks', type=int, default=10,
+                        help='Number of sessions per verification-reading LLM call (default 10)')
+    parser.add_argument('--max_loops', type=int, default=3,
+                        help='Maximum retrieval/planning loops for agent mode (default 3)')
+    parser.add_argument('--mode', type=str, default="agent", choices=['agent', 'embed', 'keyword'])
+    parser.add_argument('--topic_filter', type=bool, default=True)
+    parser.add_argument('--user_profile', action=argparse.BooleanOptionalAction, default=True,
+                        help='Include user profile in prompts (default: True, use --no-user_profile to disable)')
+    parser.add_argument('--no_semantic', action='store_true', default=False,
+                        help='Skip semantic Stage 1; run episodic Stage 2 only on all haystack sessions')
+    parser.add_argument('--no_time_filter', action='store_true', default=False,
+                        help='Disable time_filter steps in strategy execution (can reuse plan cache)')
+    # Two-stage memory arguments
+    parser.add_argument('--semantic_ret_cache', type=str, default=None,
+                        help='Path to semantic-gte retrieval log (JSONL) for Stage 1')
+    parser.add_argument('--summary_file', type=str, default=None,
+                        help='Path to all_session_summary.json for SemanticMemoryStore')
+    parser.add_argument('--facts_file', type=str, default=None,
+                        help='Path to all_session_user_facts.json for SemanticMemoryStore')
+    parser.add_argument('--all_sessions_file', type=str, default=None,
+                        help='Path to all_sessions.json for lazy episodic loading')
+    parser.add_argument('--no_save_cache', action='store_true', default=False,
+                        help='Disable saving plan/reading caches to disk after the run')
+    parser.add_argument('--hier_v2', action='store_true', default=False,
+                        help='Stage 1 produces candidates only: skip early-answer return, semantic keyword expansion, time_filter, and is_answerable shortcut')
+    parser.add_argument('--hier_union', action='store_true', default=False,
+                        help='hier mode: union Stage-1 semantic candidates with flat-embedding top-K and run agent loop on the merged pool')
+    parser.add_argument('--hier_union_flat_k', type=int, default=20,
+                        help='How many flat-embedding top-K IDs to union into the Stage-2 pool (default 20)')
+    parser.add_argument('--no_early_answer', action='store_true', default=False,
+                        help='Disable Stage-1 is_answerable early-return shortcut; always proceed to Stage-2 agent loop')
+    parser.add_argument('--answer_prompt_v2', action='store_true', default=False,
+                        help='Use the v2 answer prompt with explicit guidance for aggregation, temporal reasoning, knowledge updates, and absence cases.')
+    args = parser.parse_args()
+    # Rebind reading cache to include n_chunks so different chunk sizes get separate caches
+    veri_reading_log_file = os.environ['reading_cache'] + f'_nchunks{args.n_chunks}'
+    qid2rel_sess_ids = {}
+    if os.path.exists(veri_reading_log_file):
+        qid2rel_sess_ids = json.load(open(veri_reading_log_file))
+    print(f'Reading cache: {veri_reading_log_file} ({len(qid2rel_sess_ids)} cached entries)')
+    in_data = json.load(open(args.in_file))
+    top_k = args.top_k
+    out_file = args.out_file
+    model_info = model_zoo[args.model_name]
+    deployment_name, api_version = model_info
+    existings = set()
+    retrieval_metric_list = []
+    if os.path.exists(out_file):
+        for line in open(out_file):
+            obj = json.loads(line)
+            existings.add(obj['question_id'])
+            if 'retrieval_metric' in obj:
+                retrieval_metric_list.append(obj['retrieval_metric'])
+    out_f = open(out_file, 'a')
+    ############# read meta files #####################
+    qid2profiles = {}
+    with open("metadata/generated_user_profile.json") as f:
+        qid2profiles = json.load(f)
+    sess2topic = {}
+    with open("metadata/sessions_with_topic.json") as f:
+        sess2topic = json.load(f)
+    # ----------------------------------------------------------------
+    # Two-stage memory stores (optional; activated by CLI args)
+    # ----------------------------------------------------------------
+    semantic_store = None
+    episodic_store = None
+    semantic_ret_dict = None
+    if args.summary_file and args.facts_file:
+        semantic_store = SemanticMemoryStore(args.summary_file, args.facts_file)
+    if args.all_sessions_file:
+        episodic_store = EpisodicMemoryStore(args.all_sessions_file)
+    if args.semantic_ret_cache:
+        print(f"Loading semantic retrieval cache from {args.semantic_ret_cache} ...")
+        sem_ret_data = [json.loads(line) for line in open(args.semantic_ret_cache)]
+        semantic_ret_dict = {x['question_id']: x for x in sem_ret_data}
+        print(f"  Loaded {len(semantic_ret_dict)} entries.")
+    retrieval_metric_list = []
+    for di, entry in enumerate(in_data):
+        item_start_time = time.time()
+        qid, question, q_date = entry['question_id'], entry['question'], entry['question_date']
+        q_date = entry['question_date']
+        if qid in existings:
+            continue
+        haystack_sess_ids = entry['haystack_session_ids']
+        haystack_topics = [sess2topic.get(sid, {}).get('category', []) for sid in haystack_sess_ids]
+        date_lookup = dict(zip(haystack_sess_ids, entry['haystack_dates']))
+        topic_lookup = dict(zip(haystack_sess_ids, haystack_topics))
+        # Build ChatHistory: lazily from episodic store (two-stage) or from raw data (legacy)
+        if episodic_store is not None:
+            # Two-stage mode: start with full haystack loaded from episodic store
+            # (Stage 1 will narrow this down before Stage 2 runs)
+            raw_sessions = episodic_store.get_raw_sessions(
+                haystack_sess_ids, date_lookup, topic_lookup
+            )
+            chat_history = ChatHistory(sessions=raw_sessions)
+        else:
+            chat_history = ChatHistory({
+                "haystack_dates": entry['haystack_dates'],
+                "haystack_session_ids": entry['haystack_session_ids'],
+                "haystack_sessions": entry['haystack_sessions'],
+                "haystack_topics": haystack_topics,
+            })
+        topic_set = set()
+        for ht in haystack_topics:
+            topic_set.update(ht)
+        if args.user_profile:
+            # user profile
+            temp_qid = qid
+            if '_q_' in qid:
+                temp_qid = qid.split("_q_")[0]
+            user_profile = qid2profiles[temp_qid]
+            agent = RetrievalAgent(
+                chat_history,
+                list(topic_set),
+                user_profile=user_profile,
+                debug=args.debug,
+                vllm=args.vllm,
+                vllm_reading=args.vllm_reading,
+                tritonai=args.tritonai,
+                nvidia=args.nvidia,
+                n_chunks=args.n_chunks,
+                topic_filter=args.topic_filter,
+                no_time_filter=args.no_time_filter,
+                semantic_store=None if args.no_semantic else semantic_store,
+                episodic_store=episodic_store,
+                hier_v2=args.hier_v2,
+                hier_union=args.hier_union,
+                hier_union_flat_k=args.hier_union_flat_k,
+                no_early_answer=args.no_early_answer,
+            )
+        else:
+            agent = RetrievalAgent(
+                chat_history,
+                list(topic_set),
+                debug=args.debug,
+                vllm=args.vllm,
+                vllm_reading=args.vllm_reading,
+                tritonai=args.tritonai,
+                nvidia=args.nvidia,
+                n_chunks=args.n_chunks,
+                topic_filter=args.topic_filter,
+                no_time_filter=args.no_time_filter,
+                semantic_store=None if args.no_semantic else semantic_store,
+                episodic_store=episodic_store,
+                hier_v2=args.hier_v2,
+                hier_union=args.hier_union,
+                hier_union_flat_k=args.hier_union_flat_k,
+                no_early_answer=args.no_early_answer,
+            )
+        try:
+          if args.mode == 'embed':
+            final_sess = embedding_search(chat_history, qid, top_k=top_k)
+            attempt_record = [{"plan": {"answer": "none", "reason": "embedding retrieval only"}}]
+          elif args.mode == 'keyword':
+            keywords = generate_keywords(question, deployment_name, api_version,
+                                         debug=args.debug, vllm=args.vllm,
+                                         tritonai=args.tritonai, nvidia=args.nvidia)
+            final_sess = keyword_search(chat_history, keywords=keywords)
+            attempt_record = [{"plan": {"answer": "none", "reason": "keyword retrieval only"}}]
+          else:  # agent
+            final_sess, attempt_record = agent.run(
+                qid, question, q_date, top_k, model_info,
+                max_loops=args.max_loops,
+                semantic_ret_dict=semantic_ret_dict,
+                haystack_sess_ids=haystack_sess_ids,
+                date_lookup=date_lookup,
+                topic_lookup=topic_lookup,
+            )
+          if len(attempt_record) == 1 and "answer" in attempt_record[0]["plan"] and not ("none" in attempt_record[0]["plan"]["answer"].lower()):
+            answer = attempt_record[0]["plan"]["answer"]
+            token_budget = agent.get_token_budget() if args.mode == 'agent' else {}
+            wall_time_sec = time.time() - item_start_time
+            print(json.dumps({"q_idx": di, 'question_id': qid, 'question': entry['question'],
+                                'answer': answer, 'n_retrieved': len(final_sess),
+                                'wall_time_sec': round(wall_time_sec, 3)}, indent=4), flush=True)
+            print(json.dumps({"q_idx": di, 'question_id': qid,
+                    'hypothesis': answer,
+                    "attempt_record": attempt_record,
+                    "token_budget": token_budget,
+                    "wall_time_sec": wall_time_sec}), file=out_f, flush=True)
+          else:
+            if len(final_sess) > top_k and retrieved_log_file is not None:
+                final_top_k_sess, _ = filter_out_by_embedding(final_sess, qid=qid, top_k=top_k)
+                retrieved_str = final_top_k_sess.to_prompt(granularity="session", _format="json")
+            else:
+                retrieved_str = final_sess.to_prompt(granularity="session", _format="json")
+            if args.answer_prompt_v2:
+                answer_prompt_template = (
+                    "You are answering a question using a list of chat-session transcripts between the user and an assistant.\n"
+                    "\n"
+                    "How to answer:\n"
+                    "1. Scan ALL retrieved sessions in chronological order. The SESSION DATE on each transcript is when that conversation occurred. The Current Date below is when the question was asked, not when events happened.\n"
+                    "2. Identify every session containing a candidate fact. If sessions conflict, prefer the most RECENT session that addresses the same fact (knowledge update).\n"
+                    "3. For aggregation questions ('how many', 'list all', 'between X and Y'), enumerate matches across ALL relevant sessions; do not stop at the first.\n"
+                    "4. For temporal queries ('last Friday', 'two weeks ago'), resolve the relative date against the SESSION DATE of the session that uses that phrase, not the Current Date.\n"
+                    "5. If the retrieved sessions do NOT contain the answer, reply exactly 'Insufficient information in retrieved sessions.' Do not fabricate.\n"
+                    "6. Be terse: state the direct answer first, then one short sentence citing the session date(s) you relied on.\n"
+                    "\n"
+                    "Chat history sessions:\n"
+                    "\n"
+                    "{}\n"
+                    "\n"
+                    "Current Date: {}\n"
+                    "Question: {}\n"
+                    "Answer:"
+                )
+            else:
+                answer_prompt_template = "I will give you several chat history sessions between you and a user. Please answer the question given the information.\n\n\nChat history sessions:\n\n{}\n\nCurrent Date: {}\nQuestion: {}\nAnswer:"
+            answer_prompt = answer_prompt_template.format(retrieved_str, entry['question_date'], entry['question'])
+            completion = llm_call(
+                deployment_name,
+                api_version,
+                answer_prompt,
+                debug=args.debug,
+                vllm=args.vllm,
+                tritonai=args.tritonai,
+                nvidia=args.nvidia,
+            )
+            answer = (completion.choices[0].message.content or "").strip()
+            if args.mode == 'agent':
+                agent._track_usage('final_answer', completion)
+            token_budget = agent.get_token_budget() if args.mode == 'agent' else {}
+            retrieval_metric = {}
+            if len(final_sess) > 0 and retrieved_log_file is not None:
+                sess_sorted = embedding_search(final_sess, qid, top_k=20)
+                sess_id_sorted = sess_sorted.get_session_ids()
+                for topk in [5, 10, 20, 30]:
+                    recall_any, recall_all = evaluate_retrieval(sess_id_sorted[:topk], entry['answer_session_ids'])
+                    retrieval_metric.update({
+                    'recall_any@{}'.format(topk): recall_any,
+                    'recall_all@{}'.format(topk): recall_all
+                })
+                retrieval_metric_list.append(retrieval_metric)
+                print_average_metrics(retrieval_metric_list)
+            print(json.dumps({"q_idx": di, 'n_prompt_tok': completion.usage.prompt_tokens,
+                                'n_completion_tok': completion.usage.completion_tokens,
+                                'hypothesis': answer,
+                                'wall_time_sec': round(time.time() - item_start_time, 3)}), flush=True)
+            print(json.dumps({"q_idx": di, 'question_id': qid,
+                                'hypothesis': answer,
+                                'n_prompt_tok': completion.usage.prompt_tokens,
+                                'n_completion_tok': completion.usage.completion_tokens,
+                                "attempt_record": attempt_record,
+                                "retrieved_sess_ids": final_sess.get_session_ids(),
+                                "retrieval_metric": retrieval_metric,
+                                "token_budget": token_budget,
+                                "wall_time_sec": time.time() - item_start_time}), file=out_f, flush=True)
+        except Exception as e:
+            print(f"[ERROR] q_idx={di} qid={qid} failed: {e}", flush=True)
+            continue
+    ############# save cache ##########################
+    if not args.no_save_cache:
+        with open(plan_cache_file, "w") as fw:
+            json.dump(qid2plan, fw, indent=2)
+        with open(veri_reading_log_file, "w") as fw:
+            json.dump(qid2rel_sess_ids, fw, indent=2)

memory/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from memory.episodic_store import EpisodicMemoryStore
2	+ from memory.semantic_store import SemanticMemoryStore

memory/episodic_store.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""
+episodic_store.py
+Lazy loader for all_sessions.json. Keeps the full raw-turn data in memory
+and returns sessions on demand by session ID, avoiding loading all sessions
+into ChatHistory upfront.
+"""
+import json
+from datetime import datetime
+from typing import Dict, List, Optional
+class EpisodicMemoryStore:
+    def __init__(self, all_sessions_path: str):
+        print(f"[EpisodicMemoryStore] Loading {all_sessions_path} ...")
+        with open(all_sessions_path) as f:
+            self._data: Dict[str, List] = json.load(f)
+        print(f"[EpisodicMemoryStore] Loaded {len(self._data)} sessions.")
+    @staticmethod
+    def _parse_date(date_str: str) -> datetime:
+        """Convert '2023/04/10 (Mon) 17:50' to datetime."""
+        date_part = date_str.split('(')[0].strip()
+        time_part = date_str.split(')')[-1].strip()
+        return datetime.strptime(date_part + time_part, "%Y/%m/%d%H:%M")
+    def get_raw_sessions(
+        self,
+        sess_ids: List[str],
+        date_lookup: Dict[str, str],
+        topic_lookup: Optional[Dict[str, List[str]]] = None,
+    ) -> List[dict]:
+        """
+        Return a list of session dicts compatible with ChatHistory(sessions=...).
+        Args:
+            sess_ids:     Ordered list of session IDs to load.
+            date_lookup:  Mapping sess_id -> date string (e.g. '2023/04/10 (Mon) 17:50').
+            topic_lookup: Optional mapping sess_id -> list of topic strings.
+        Returns:
+            List of session dicts, each with keys:
+              session_id, session_date, session, topic, timestamp
+        """
+        sessions = []
+        for sid in sess_ids:
+            if sid not in self._data:
+                continue
+            date_str = date_lookup.get(sid, "")
+            try:
+                ts = self._parse_date(date_str) if date_str else datetime.min
+            except Exception:
+                ts = datetime.min
+            sessions.append({
+                "session_id": sid,
+                "session_date": date_str,
+                "session": self._data[sid],
+                "topic": topic_lookup.get(sid, []) if topic_lookup else [],
+                "timestamp": ts,
+            })
+        return sessions

memory/semantic_store.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+semantic_store.py
+Wrapper around all_session_summary.json and all_session_user_facts.json.
+Provides:
+  - keyword_search(): find sessions whose semantic text contains given keywords
+  - to_prompt(): format semantic context for LLM consumption
+  - get_text(): return raw semantic text for a session (for embedding/search)
+"""
+import json
+from typing import Dict, List, Optional
+class SemanticMemoryStore:
+    def __init__(self, summary_path: str, facts_path: str):
+        print(f"[SemanticMemoryStore] Loading {summary_path} ...")
+        with open(summary_path) as f:
+            self._summaries: Dict[str, dict] = json.load(f)
+        print(f"[SemanticMemoryStore] Loading {facts_path} ...")
+        with open(facts_path) as f:
+            self._facts: Dict[str, list] = json.load(f)
+        print(f"[SemanticMemoryStore] Loaded {len(self._summaries)} summaries, "
+              f"{len(self._facts)} fact entries.")
+    def get_summary(self, sess_id: str) -> str:
+        """Return the session-level summary string, or empty string."""
+        entry = self._summaries.get(sess_id, {})
+        return entry.get("session_summary", "").strip()
+    def get_facts_text(self, sess_id: str) -> str:
+        """Return user facts as a single joined string, or empty string."""
+        fact_list = self._facts.get(sess_id, [])
+        if not fact_list:
+            return ""
+        return " ".join(
+            f["user-info"] for f in fact_list
+            if isinstance(f, dict) and f.get("user-info")
+        ).strip()
+    def get_text(self, sess_id: str) -> str:
+        """Return summary + facts combined (for keyword search or display)."""
+        parts = [self.get_summary(sess_id), self.get_facts_text(sess_id)]
+        return " ".join(p for p in parts if p)
+    def keyword_search(self, keywords: List[str], haystack_sess_ids: List[str]) -> List[str]:
+        """
+        Search semantic text (summary + facts) of the given sessions for any keyword.
+        Returns:
+            List of matching session IDs (preserving haystack order).
+        """
+        matched = []
+        kws_lower = [kw.lower() for kw in keywords if kw]
+        for sid in haystack_sess_ids:
+            text = self.get_text(sid).lower()
+            if any(kw in text for kw in kws_lower):
+                matched.append(sid)
+        return matched
+    def to_prompt(self, sess_ids: List[str], date_lookup: Optional[Dict[str, str]] = None) -> str:
+        """
+        Format semantic context for these sessions as a prompt string.
+        Each session block:
+            Session Date: <date>
+            Summary: <session_summary>
+            User Facts: <fact1>; <fact2>; ...
+        """
+        lines = []
+        for sid in sess_ids:
+            date_str = date_lookup.get(sid, "") if date_lookup else ""
+            summary = self.get_summary(sid)
+            facts_text = self.get_facts_text(sid)
+            block = f"Session ID: {sid}"
+            if date_str:
+                block += f"\nSession Date: {date_str}"
+            if summary:
+                block += f"\nSummary: {summary}"
+            if facts_text:
+                block += f"\nUser Facts: {facts_text}"
+            lines.append(block)
+        return "\n\n".join(lines)

model_zoo.py ADDED Viewed

	@@ -0,0 +1,31 @@

+model_zoo = {
+    # OpenAI / Azure-hosted models (deployment string, api_version)
+    'gpt-5': ("gpt-5_2025-08-07", "2024-12-01-preview"),
+    "gpt-4.1-azure": ("gpt-4.1_2025-04-14", "2025-04-01-preview"),
+    'gpt-4o': ('gpt-4o_2024-11-20', '2024-10-21'),
+    'gpt-4o-mini': ("gpt-4o-mini", ""),
+    'gpt-5-openai': ("gpt-5", ""),
+    'gpt-5-mini-openai': ("gpt-5-mini", ""),
+    # vLLM-hosted models (OpenAI-compatible server)
+    'Qwen3-30B-A3B-Instruct-2507': ("Qwen/Qwen3-30B-A3B-Instruct-2507", ""),
+    'Qwen3-VL-30B-A3B-Instruct': ("Qwen3-VL-30B-A3B-Instruct", ""),
+    # Anthropic models via direct Anthropic API (uses ANTHROPIC_API_KEY)
+    'claude-opus-4-6': ("claude-opus-4-6", ""),
+    'claude-sonnet-4-6': ("claude-sonnet-4-6", ""),
+    # Anthropic / DeepSeek via an OpenAI-compatible LiteLLM proxy
+    # (uses LITELLM_API_KEY; selected by main.py's --tritonai flag)
+    'claude-opus-4-6-tritonai': ("us.anthropic.claude-opus-4-6-v1", ""),
+    'claude-sonnet-4-6-tritonai': ("us.anthropic.claude-sonnet-4-6-v1", ""),
+    'deepseek-r1-tritonai': ("us.deepseek.r1-v1:0", ""),
+    # Models served via an OpenAI-compatible inference API (uses NV_API_KEY)
+    'gpt-5.1': ("openai/openai/gpt-5.1", ""),
+    'gpt-5.2': ("openai/openai/gpt-5.2", ""),
+    'gpt-5.5': ("openai/openai/gpt-5.5", ""),
+    'gpt-4.1': ("us/azure/openai/gpt-4.1", ""),
+    'Qwen3.5-397B-A17B': ("nvidia/qwen/qwen3-5-397b-a17b", ""),
+    'Kimi-K2.6': ("nvidia/moonshotai/kimi-k2.6", ""),
+}

prompts/agentic_retrieval_prompt.txt ADDED Viewed

	@@ -0,0 +1,226 @@

+You are an intelligent assistant for memory retrieval.
+Your goal is to **design and refine retrieval strategies** for answering a user’s query by leveraging the user’s memory resources (profile, topic, chat history, and other contextual knowledge).
+### Core Principles
+1. Decision-first retrieval: Decide whether retrieval is necessary.
+   * If the query can be answered **directly from the user’s profile**, retrieval is unnecessary.
+   * Otherwise, retrieval strategies must be proposed.
+   * If there is some useful information in user's profile, keep the information and continue to retrieve.
+2. Identify relevant topics:
+  * Given the topic list from user's chat history, identify topics related to the query.
+  * The topics will be used to narrow down search space. Be inclusive but not too general.
+3. Multi-method retrieval: Multiple retrieval methods may be combined:
+   * **Keyword-based retrieval** for unique names, places, or identifiers.
+   * **Embedding-based semantic search** when the query is vague, abstract, or conversational.
+   * **Time-based filtering** ONLY when the query references dates, ranges, or relative temporal expressions (e.g., “last week,” “yesterday”), use the given `query_date` to resolve them into precise ISO 8601 ranges.
+4. Loop-aware evidence collection: Retrieval may occur in **multiple iterations (loops)**. At each loop:
+   * Collect **evidence** from:
+     * User profile (static attributes like age, location, job, preferences)
+     * Topics (higher-level semantic indexing)
+     * Raw chat sessions (extract **key clue sentences**)
+  * Incorporate **previous retrieval attempts** (if provided) and refine strategy. If your previous attempt failed to retreive relevant sessions or evidneces, try without filter methods such as topic and time filters.
+5. Consistent JSON output: All outputs must follow the unified schema to enable downstream automation.
+---
+## JSON Output Schema
+```json
+{
+  "query": "<original user query>",
+  "query_date": "<ISO 8601 date string>",
+  "loop_iteration": <integer, starts at 1>,
+  "retrieval_decision": "<none | retrieval_required>",
+  "answer": "<none | answer if possilbe>",
+  "topics": ["<list of relevant topics>"],
+  "strategy": [
+    {
+      "method": "<keyword | time_filter | embedding>",
+      "conditions": "<why this method is chosen>",
+      "keywords": ["<list of keywords>"],
+      "time_range": ["<start_date>", "<end_date>"]
+    }
+  ],
+  "evidence": {
+    "profile": ["<relevant profile snippets>"],
+    "chat_clues": ["<list of key sentences extracted from chat history>"]
+  },
+  "previous_attempts": [
+    {
+      "loop_iteration": <integer>,
+      "strategy": [ ... ],
+      "evidence": { ... },
+      "outcome": "<insufficient | useful | final_answer_ready>"
+    }
+  ]
+}
+```
+---
+## Examples
+### Example 1: No retrieval needed (profile sufficient)
+```json
+{
+  "query": "Where do I live?",
+  "query_date": "2025-08-29",
+  "loop_iteration": 1,
+  "retrieval_decision": "none",
+  "answer": "San Diego, California",
+  "topics": [],
+  "strategy": [],
+  "evidence": {
+    "profile": ["User lives in San Diego, California."],
+    "chat_clues": []
+  },
+  "previous_attempts": []
+}
+```
+---
+### Example 2: Keyword strategy
+```json
+{
+  "query": "Who did I go to the Grand Canyon with?",
+  "query_date": "2025-08-29",
+  "loop_iteration": 1,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Travel & Transportation", "Family & Relationships", "Personal Development"],
+  "strategy": [
+    {
+      "method": "keyword",
+      "conditions": "Query contains specific event and place name.",
+      "keywords": ["Grand Canyon"]
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": []
+  },
+  "previous_attempts": []
+}
+```
+---
+### Example 3: Time-filter strategy with relative date
+Query: How many miles have I hiked in the past two weeks?
+Query date: 2023/05/22 (Mon) 23:31
+```json
+{
+  "query": "How many miles have I hiked in the past two weeks?",
+  "query_date": "2023-05-22 (Mon) 23:31",
+  "loop_iteration": 1,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Sports & Fitness", "Health & Wellness", "Travel & Transportation", "Personal Development"],
+  "strategy": [
+    {
+      "method": "time_filter",
+      "conditions": "Temporal phrase 'past two weeks' resolved using query_date.",
+      "time_range": ["2023-05-08", "2023-05-22"]
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": []
+  },
+  "previous_attempts": [],
+}
+```
+---
+### Example 4: Keyword + Embedding
+```json
+{
+  "query": "What did my doctor say about my back pain treatment options?",
+  "query_date": "2025-08-29",
+  "loop_iteration": 1,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Health & Wellness", "Work & Career"],
+  "strategy": [
+    {
+      "method": "keyword",
+      "conditions": "Query contains specific medical terms that should be matched exactly.",
+      "keywords": ["doctor", "back pain", "treatment"]
+    },
+    {
+      "method": "embedding",
+      "conditions": "Query involves medical advice and recommendations which may be expressed in various conversational ways in chat history."
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": []
+  },
+  "previous_attempts": []
+}
+```
+### Example 5: Multi-method strategy with loop refinement
+Loop 1 outcome was insufficient, Loop 2 continues with broader retrieval.
+```json
+{
+  "query": "What were we discussing about public issues?",
+  "query_date": "2025-08-29",
+  "loop_iteration": 2,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Government & Politics", "Environment & Sustainability", "Legal"],
+  "strategy": [
+    {
+      "method": "embedding",
+      "conditions": "To capture abstract conversational references."
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": [
+      "User debated about climate policy impacts.",
+      "Conversation on local housing regulations was tagged as 'public issues'."
+    ]
+  },
+  "previous_attempts": [
+    {
+      "loop_iteration": 1,
+      "strategy": [
+        {"method": "keyword", "conditions": "Query asks about 'public issue'", "keywords": ["public issue", "public issues", "issue"]}
+      ],
+      "evidence": {
+        "profile": [],
+        "chat_clues": []
+      },
+      "outcome": "insufficient"
+    }
+  ]
+}
+```
+Use clear reasoning for how each method applies (or not) to the specific query. Be concise and precise.
+**Do not include strategies in the strategy list unless they are needed for this query. Omit unused methods.**
+**Always use the provided `query_date` to resolve any relative dates in the query.**
+**Strictly follow the given JSON output format. Return only **one** JSON output.**
+**Use time-based filtering ONLY when the query references dates, ranges, or relative temporal expressions (e.g., "last week," "yesterday," "in last 3 month")

prompts/agentic_retrieval_prompt_wo_profile.txt ADDED Viewed

	@@ -0,0 +1,203 @@

+You are an intelligent assistant for memory retrieval.
+Your goal is to **design and refine retrieval strategies** for answering a user’s query by leveraging the user’s memory resources (topic, chat history, and other contextual knowledge).
+### Core Principles
+1. Decision-first retrieval: Decide whether retrieval is necessary.
+   * If the query can be answered **directly**, retrieval is unnecessary.
+   * Otherwise, retrieval strategies must be proposed.
+2. Identify relevant topics:
+  * Given the topic list from user's chat history, identify topics related to the query.
+  * The topics will be used to narrow down search space. Be inclusive but not too general.
+3. Multi-method retrieval: Multiple retrieval methods may be combined:
+   * **Keyword-based retrieval** for unique names, places, or identifiers.
+   * **Embedding-based semantic search** when the query is vague, abstract, or conversational.
+   * **Time-based filtering** ONLY when the query references dates, ranges, or relative temporal expressions (e.g., “last week,” “yesterday”), use the given `query_date` to resolve them into precise ISO 8601 ranges.
+4. Loop-aware evidence collection: Retrieval may occur in **multiple iterations (loops)**. At each loop:
+   * Collect **evidence** from:
+     * Topics (higher-level semantic indexing)
+     * Raw chat sessions (extract **key clue sentences**)
+  * Incorporate **previous retrieval attempts** (if provided) and refine strategy. If your previous attempt failed to retreive relevant sessions or evidneces, try without filter methods such as topic and time filters.
+5. Consistent JSON output: All outputs must follow the unified schema to enable downstream automation.
+---
+## JSON Output Schema
+```json
+{
+  "query": "<original user query>",
+  "query_date": "<ISO 8601 date string>",
+  "loop_iteration": <integer, starts at 1>,
+  "retrieval_decision": "<none | retrieval_required>",
+  "answer": "<none | answer if possilbe>",
+  "topics": ["<list of relevant topics>"],
+  "strategy": [
+    {
+      "method": "<keyword | time_filter | embedding>",
+      "conditions": "<why this method is chosen>",
+      "keywords": ["<list of keywords>"],
+      "time_range": ["<start_date>", "<end_date>"]
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": ["<list of key sentences extracted from chat history>"]
+  },
+  "previous_attempts": [
+    {
+      "loop_iteration": <integer>,
+      "strategy": [ ... ],
+      "evidence": { ... },
+      "outcome": "<insufficient | useful | final_answer_ready>"
+    }
+  ]
+}
+```
+---
+## Examples
+### Example 1: Keyword strategy
+```json
+{
+  "query": "Who did I go to the Grand Canyon with?",
+  "query_date": "2025-08-29",
+  "loop_iteration": 1,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Travel & Transportation", "Family & Relationships", "Personal Development"],
+  "strategy": [
+    {
+      "method": "keyword",
+      "conditions": "Query contains specific event and place name.",
+      "keywords": ["Grand Canyon"]
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": []
+  },
+  "previous_attempts": []
+}
+```
+---
+### Example 2: Time-filter strategy with relative date
+Query: How many miles have I hiked in the past two weeks?
+Query date: 2023/05/22 (Mon) 23:31
+```json
+{
+  "query": "How many miles have I hiked in the past two weeks?",
+  "query_date": "2023-05-22 (Mon) 23:31",
+  "loop_iteration": 1,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Sports & Fitness", "Health & Wellness", "Travel & Transportation", "Personal Development"],
+  "strategy": [
+    {
+      "method": "time_filter",
+      "conditions": "Temporal phrase 'past two weeks' resolved using query_date.",
+      "time_range": ["2023-05-08", "2023-05-22"]
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": []
+  },
+  "previous_attempts": [],
+}
+```
+---
+### Example 3: Keyword + Embedding
+```json
+{
+  "query": "What did my doctor say about my back pain treatment options?",
+  "query_date": "2025-08-29",
+  "loop_iteration": 1,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Health & Wellness", "Work & Career"],
+  "strategy": [
+    {
+      "method": "keyword",
+      "conditions": "Query contains specific medical terms that should be matched exactly.",
+      "keywords": ["doctor", "back pain", "treatment"]
+    },
+    {
+      "method": "embedding",
+      "conditions": "Query involves medical advice and recommendations which may be expressed in various conversational ways in chat history."
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": []
+  },
+  "previous_attempts": []
+}
+```
+### Example 4: Multi-method strategy with loop refinement
+Loop 1 outcome was insufficient, Loop 2 continues with broader retrieval.
+```json
+{
+  "query": "What were we discussing about public issues?",
+  "query_date": "2025-08-29",
+  "loop_iteration": 2,
+  "retrieval_decision": "retrieval_required",
+  "answer": "none",
+  "topics": ["Government & Politics", "Environment & Sustainability", "Legal"],
+  "strategy": [
+    {
+      "method": "embedding",
+      "conditions": "To capture abstract conversational references."
+    }
+  ],
+  "evidence": {
+    "profile": [],
+    "chat_clues": [
+      "User debated about climate policy impacts.",
+      "Conversation on local housing regulations was tagged as 'public issues'."
+    ]
+  },
+  "previous_attempts": [
+    {
+      "loop_iteration": 1,
+      "strategy": [
+        {"method": "keyword", "conditions": "Query asks about 'public issue'", "keywords": ["public issue", "public issues", "issue"]}
+      ],
+      "evidence": {
+        "profile": [],
+        "chat_clues": []
+      },
+      "outcome": "insufficient"
+    }
+  ]
+}
+```
+Use clear reasoning for how each method applies (or not) to the specific query. Be concise and precise.
+**Do not include strategies in the strategy list unless they are needed for this query. Omit unused methods.**
+**Always use the provided `query_date` to resolve any relative dates in the query.**
+**Strictly follow the given JSON output format. Return only **one** JSON output.**
+**Use time-based filtering ONLY when the query references dates, ranges, or relative temporal expressions (e.g., "last week," "yesterday," "in last 3 month")

prompts/keyword_search_prompt.txt ADDED Viewed

	@@ -0,0 +1,31 @@

+You are a memory retrieval assistant specialized in extracting highly specific search keywords from user queries.
+Your task: Analyze the user's query and identify the most distinctive, specific keywords that will precisely match relevant memories.
+Guidelines:
+- Extract ONLY specific, unique terms (proper nouns, specific places, distinct events, particular objects)
+- Prioritize specificity over completeness - fewer precise keywords are better than many generic ones
+- EXCLUDE generic words like: trip, visit, appointment, gift, location, person, time, day
+- EXCLUDE question words: who, what, when, where, why, how
+- EXCLUDE auxiliary verbs: did, was, have, can, should
+- Include 1-3 keywords maximum, ordered by specificity
+- Preserve exact phrasing for proper nouns and named entities
+Query: "Who did I go to the Grand Canyon with?"
+Output format (JSON):
+```json
+{
+  "query": "Who did I go to the Grand Canyon with?",
+  "keywords": ["Grand Canyon"]
+}
+```
+Additional examples:
+- "What did Sarah give me for my birthday?" → ["Sarah", "birthday"]
+- "When did I last visit Dr. Martinez?" → ["Dr. Martinez"]
+- "Where did I put my car keys?" → ["car keys"]
+- "What happened at the team meeting?" → ["team meeting"]
+- "Did I finish the Johnson report?" → ["Johnson report"]
+Query:

prompts/read_and_extract_prompt.txt ADDED Viewed

	@@ -0,0 +1,176 @@

+You are given a list of chat sessions between a user and an AI assistant.
+Your task:
+Given a question, identify the sessions that are relevant to answer the question.
+**Output format:**
+```json
+{"index": [<list of 0-based session indices>], "evidence": [<list of sentences that serve as evidence to answer the question>]}
+```
+If none are relevant, return:
+```json
+{"index": [], "evidence": []}
+```
+---
+# Examples
+---
+## Example 1 (multiple relevant sessions)
+Question: Who did I go hiking with at Mount Rainier?
+Sessions:
+### Session Index: 0
+[{"role": "user", "content": "I went hiking at Mount Rainier on May 10."}, {"role": "assistant", "content": "Nice! Which trail did you take?"}, {"role": "user", "content": "Skyline Trail."}]
+### Session Index: 1
+[{"role": "user", "content": "The weather was great in Yosemite."}, {"role": "assistant", "content": "Did you go there recently?"}]
+### Session Index: 2
+[{"role": "user", "content": "On May 10, I hiked with Sarah and John."}, {"role": "assistant", "content": "Sounds like a fun group!"}]
+### Session Index: 5
+[{"role": "user", "content": "I heard Sarah recently broke up with her boyfriend."}, {"role": "assistant", "content": "Sarah is your close friend from Seattle, right?"}, {"role": "user", "content": "Yes, we’ve been friends since college."}]
+Explanation:
+* Session 0 is relevant because it provides the **location and date** (“Mount Rainier on **May 10**”) but no names.
+* Session 1 Yosemite is irrelevant to Rainier.
+* Session 2 is relevant because it provides the **names and date** (“**On May 10**, I hiked with **Sarah and John**”) but no location.
+* Session 5 adds background about Sarah and is not needed to answer the question.
+The answer requires **combining the shared date (May 10)** across Sessions 0 and 2 to link the names to Mount Rainier.
+**Final JSON output:**
+```json
+{"index": [0, 2], "evidence": ["User went hiking at Mount Rainier on May 10.", "User hiked with Sarah and John on May 10."]}
+```
+---
+## Example 2 (no relevant sessions)
+Question: "When did I buy my new iPhone?"
+Sessions:
+### Session Index: 0
+[{"role": "user", "content": "I love my iPhone 14 camera!"}, {"role": "assistant", "content": "Yes, it takes great photos."}]
+### Session Index: 3
+[{"role": "user", "content": "I’m thinking about buying a new iPhone soon."}, {"role": "assistant", "content": "The new model will be released in the fall."}]
+Explanation:
+* None of the sessions give an exact purchase date.
+* Session 0 confirms ownership but not when it was purchased.
+* Session 3 is about a future plan, not an actual purchase.
+**Final JSON output:**
+```json
+{"index": [], "evidence": []}
+```
+---
+## Example 3 (single relevant session)
+Question: "What is the name of my cat?"
+Sessions:
+### Session Index: 0
+[{"role": "user", "content": "I just adopted a cat named Luna."}, {"role": "assistant", "content": "She must be adorable!"}, {"role": "user", "content": "Yes, she’s very playful."}]
+### Session Index: 4
+[{"role": "assistant", "content": "Dogs are usually easier to train than cats."}, {"role": "user", "content": "Yeah, but I love cats."}]
+Explanation:
+* Only session 0 contains the name of the cat, “Luna”.
+* Session 4 is general pet advice and irrelevant.
+**Final JSON output:**
+```json
+{"index": [0], "evidence": ["User adopted a cat named Luna."]}
+```
+---
+## Example 4 (combining across sessions)
+Question: "What was the distance of my last two hikes?"
+Sessions:
+### Session Index: 0
+[{"role": "user", "content": "Last weekend, I hiked 5 miles at Storm King Trail."}, {"role": "assistant", "content": "That’s a nice trail."}, {"role": "user", "content": "Yes, the views were amazing."}]
+### Session Index: 1
+[{"role": "user", "content": "Two weeks ago, I did a 7-mile hike at Rattlesnake Ridge."}, {"role": "assistant", "content": "That’s a great workout!"}, {"role": "user", "content": "It was challenging but worth it."}]
+### Session Index: 3
+[{"role": "assistant", "content": "Next time, try Mount Si!"}, {"role": "user", "content": "I’ll add it to my list."}]
+Explanation:
+* Session 0 provides the first hike’s distance.
+* Session 1 provides the second hike’s distance.
+* In Session 3, the assistant suggests another hike but contains no actual distance information and there is no guarantee that the user actually went hiking or not.
+**Final JSON output:**
+```json
+{"index": [0, 1], "evidence": ["User hiked 5 miles at Storm King Trail last weekend.", "User hiked 7 miles at Rattlesnake Ridge two weeks ago."]}
+```
+---
+## Example 5 (time reference)
+Question: "What did I do last Friday?"
+Question date: 2023/05/23 (Tue) 12:26
+Sessions:
+### Session Index: 0
+### Session Date: 2023/05/20 (Sat) 03:14
+[{"role": "user", "content": "Yesterday, I went to a concert downtown."}, {"role": "assistant", "content": "Who performed?"}, {"role": "user", "content": "It was The Lumineers."}]
+### Session Index: 2
+### Session Date: 2023/05/22 (Mon) 11:59
+[{"role": "user", "content": "I went hiking last weekend."}, {"role": "assistant", "content": "Where did you go?"}, {"role": "user", "content": "Mount Si."}]
+Explanation:
+* Only session 0 is relevant because: the session took place on **Saturday, May 20**, and the user says “Yesterday,” which, relative to the session date (Saturday), refers to **Friday, May 19**. Given the question date (**Tuesday, May 23**), “last Friday” would be **May 19**, so the concert occurred on last Friday, meaning this is the correct match for what the user did around that timeframe. The activity (“went to a concert downtown” with The Lumineers) answers the question.
+* Session 2 is irrelevant because “last weekend” (relative to Monday, May 22) refers to May 20–21, not **Friday, May 19**, so it does not answer the question.
+**Final JSON output:**
+```json
+{"index": [0], "evidence": ["User went to a concert downtown on last Friday, May 19 and user said it was the Lumineers."]}
+```
+---
+Use exact indices from the provided list of sessions in your JSON output.
+If the question is related to time, specify the date in the evidence.