Inference VRAM usage is abnormally large

#37

by manu - opened Jun 14, 2023

manu

Jun 14, 2023

I get GPU OOM on 40GB A100s with a batch size of 1 and context lenghts of just a tad more than 512 tokens in greedy search, all the while being able to train models with a micro batch size of 4 in Low rank (bf16). Don't have the problem for copmparable sizes of llama, pythia, bloom, opt, etc...

I am wondering if this is caused by the multi-query attention jey value cache which could be badly configured in the rw_modeling file ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment