Locally run instance in an RTX 3090 - Performance?

#119
by byeai - opened

Hello,

I am a bit of a hobbyist looking to use the amazing tutorial of @arteagac to run my own instance of Bloom-176B as a personal project on an Nvidia RTX 3090. Several users in this forum seem to be already doing this as far as I've seen using a slight adjustment in the code.

How long does computing a token take using this GPU? The article above mentions that it takes about 3 minutes to generate a token with an i5 processor. Is there even notable performance to be gained by making use of the GPU or is the SSD the deciding bottleneck here? (using a PCIe 4.0 Samsung 980 Pro just like in the article)

Thank you for your help.

Here's the mentioned article: https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32

Hi @byeai , you can easily run it on your 3090 by simply setting device='cuda:0' in the script. However, I think this won't significantly speed up the token generation, because as you said, the bottleneck is in reading the model blocks from disk. In a hypothetical scenario where you had enough ram (>400 GB) to fit all the model at once, then you will likely see some significant improvements in speed using a GPU. In such a case, the only bottleneck would be moving the model back and forth between the RAM and GPU memory, but that should be manageable.

@arteagac Apologies for the bump.

So given a system configuration of a single RTX 3090 plus 512GB of ordinary RAM, would there be any changes to the script (besides device='cuda:0') to ensure that it loads the whole model to memory instead of running from disk?

Hi @cliv24 ,

If you have enough RAM to fit the entire model, then you need some minimal changes in the script. First, remove the portion that loads each BLOOM block in the forward method. Second, load the entire model at the beggining of the script and save it on a list of blocks (for this you can re-use the code you removed from the forwardmethod). Finally use this list of blocks inside the forward method to process each block. Note that in this final step you need to properly load and offload each block from the GPU. If you need further assistance, please let me know, I would be more than happy to help.

if you've got 500gb of ram lying around, you may be able to specify most of it as a RAM Disk, move the model weights there, and use the model as you normally would... Assuming the IO pipeline is reasonably optimized - no overhead of checking file hashes every time you read a file, etc - you might get pretty close to full-speed without any extra work. (I haven't tried this for the models in question, but have used this trick effectively in many other contexts.)

Hi @cliv24 ,

If you have enough RAM to fit the entire model, then you need some minimal changes in the script. First, remove the portion that loads each BLOOM block in the forward method. Second, load the entire model at the beggining of the script and save it on a list of blocks (for this you can re-use the code you removed from the forwardmethod). Finally use this list of blocks inside the forward method to process each block. Note that in this final step you need to properly load and offload each block from the GPU. If you need further assistance, please let me know, I would be more than happy to help.

Hi @arteagac would it be possible for you to post the modified script of this? for anyone that has the hardware to try.

512GB of ram is around $1800 and I'm tempting to give that a try assuming the ram does give significant speed up.

Hi @nightfuryx , Sure, in the link below you can find a modified script that assumes you have enough RAM to fit the entire model. Unfortunately, this script is untested, as I don't have enough RAM to test it, but if you run into any errors, please let me know and I would be happy to help you fix them.

https://github.com/arteagac/arteagac.github.io/blob/master/blog/bloom_ram.ipynb

hi @arteagac
Thank you very much for taking the time to update. I tried a similar approach to yours over the weekend however it seems only the original implementation would offload the model correctly. When I do the load multiple blocks and then move one by one to the GPU, then the gpu does not release this block afterward before loading the next block, so I keep running into cuda out of memory and have not found a good solution for that.

Hi @nightfuryx , Yes, I just noticed that is a potential issue. To address this, you simply need to find a way to move the block to the GPU, one at a time. I assume you can create a copy of the block and send it to the GPU and then overwrite the same copy object in the next interation.

Hi, @arteagac I wonder is there any even cheaper way to run the model in fast speed.
Like using raid0 with 3+ U.2 ssd, it should be easily reach ddr4 ram performant or even faster if using more disk.
But the problem is GPU, do you think the model can run in the weak GPU like 1060?

Sign up or log in to comment