Passing new external memories without re-loading model

#2
by xmrt - opened

Hello,

I'm using extended-mind-mpt-7b for answering muliple questions using different set of external memories from a dataset. Is there a method to feed new memories to the model without having to call AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", external_memories=memory_ids, trust_remote_code=True) each time?

Thanks!

Normal Computing org

Absolutely. Simply set:

model.empty_memories() OR model.memories = None
model._memories = memory_ids

where memory_ids are your new tokenized memories (like you'd pass to the .from_pretrained() call).

When .generate() is called, it checks:

if self._memories is not None and self.memories is None: #init memories once on first call 
     self.memories = self.generate_cache(self._memories, cache_type=self.memory_type)

Likely will make this more user-friendly in upcoming versions!

Perfect, thanks a lot! It is a super interesting model, you've developed.

I actually have another question: in the article you describe "The choice of which memories to attend to is made using cosine similarity within each decoder layer and attention head" - I have a hard time finding this place in the code, is it at line 112 in scaled_multihead_dot_product_attention in attention.py (https://huggingface.co/normalcomputing/extended-mind-mpt-7b/blob/main/attention.py)?

Normal Computing org

Indeed! When we compute the inner product (sim = q_n.matmul(k_n)) of the queries with the keys from our external memories, we've already normalized them (see lines 109, 110) so this is exactly the cosine similarity. We make the choice to normalize (regular attention uses unnormalized inner product) to mimic the way vectors are usually retrieved from a vector database.

Thanks for the questions!

Great, I see. Thanks for the fast replies!

Sign up or log in to comment