Memory when passing external memories

by xmrt - opened Mar 27, 2024

xmrt

Mar 27, 2024

Hi (again :)),
I'm having trouble when I try to run use model.generate with a lot of external memories (10 documents that together give approximately 100,000 words). Even when I run topk=0 it runs out of memory after an hour and does not finish a single question. Ideally, I would like to be able to run the model using the 100,000 tokens with a topk=10. I am using an instance with a memory 72 GiB.

Here is how I am loading my model:

configuration = transformers.AutoConfig.from_pretrained("normalcomputing/extended-mind-mpt-7b", trust_remote_code=True)
configuration.max_seq_len = 2048
configuration.init_device="meta"
configuration.attn_config['alibi'] = True
configuration.attn_config['attn_impl'] = torch
configuration.use_cache = True

generator = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", device_map="cpu", config=configuration, trust_remote_code=True)
generator.empty_memories()

tokenizer = AutoTokenizer.from_pretrained("normalcomputing/extended-mind-mpt-7b", padding_side='left')

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

And this is the tokenisation and generation:

for question, question_index in tqdm(zip(question_data, question_indices), total=len(question_indices)):
    print(question['answers'])
    userprompt = question['question']

    # Get the documents
    docs = question['contexts']
    doc_indices = random.sample(range(10), 10)
    [docs.extend(data[i]['contexts']) for i in doc_indices]

    # Create external memories
    external_memories = " ".join(docs)
    memory_ids = tokenizer(external_memories, return_tensors='pt')['input_ids'].to(device)

If you have any inputs as to what should be changed in order to be able to run the model with this many memories I would be very happy to hear them! Is there for example a possibility to send the tokenised external memories into the model in batches?

phoebeklett

Normal Computing org Mar 27, 2024

Hey! I'd recommend using memory_type=faiss, for starters. You can also try increasing the stride parameter in the generate_cache method. This may result in lower quality memories, but will be faster! The stride is used in an analogous way as this tutorial if you want to check it out: https://huggingface.co/docs/transformers/en/perplexity. Let me know if that helps!

xmrt

Mar 30, 2024

Thanks a lot for your quick response!

I have tried to set memory_type=faiss and tried to increase stride to 2048, however, it still runs out of memory. Is there a way to estimate how much memory is expected to be used with large external memories? Then I can try to upgrade my resources to match these requirements :)

phoebeklett

Normal Computing org Apr 3, 2024

If you're using faiss, the main cost is generating the cache before you pass the vectors to the db store. That cost, (if you're using stride=2048) is roughly n=input_length//2048 passes through the model. (You'll need memory for the model + ~2048 inputs, as well as for the growing vector db). Hope that helps!

xmrt

Apr 13, 2024

It does indeed! Thanks :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment