Can Bloom-176B really be evaluated on normal hardware at a rate of 3 minutes per token?

#87
by Philimipp - opened

I'm following this this wonderful article by @arteagac , trying to run Bloom-176B on my own machine by loading it into memory block by block. The author claims in an answer to a comment that the approximate duration to generate a token from a prompt is 3 minutes, using off-the-shelf hardware.

My own experiments yield other results, however. On my machine it takes around 4 minutes to even evaluate a single block (the call to block(...)), hence the loop over all ~70 blocks would require hours to terminate in order to generate a single token. May I kindly ask you to tell me what your own experience is? I'm trying to find errors on my part which are causing this huge slowdown.

While I'm using a Threadripper 3960x, a 3090 and 64GB of RAM, none of this is used as all I'm getting is single-core eval when I run the code in the article.

Hi. I'm the author of the blog post. I had a similar experience when I used an AMD processor. The inference was significantly slower, and I think it has to do with PyTorch's limited performance on AMD processors (see this discussion: https://www.reddit.com/r/MachineLearning/comments/iap6yo ). I addressed it to some degree by compiling PyTorch from scratch using OpenBLAS. However, even with the from-scratch compilation, the modest intel i5 12 gen was faster than the AMD 5900X. To use the GPU, you can change device = 'cpu' to device = 'cuda:0' in the code.

In addition, make sure you have a fast SSD, as it might be the main bottleneck when running the code in the blog post.

Hi Cristian, thanks for the reply. I wasn't aware of the disparity between AMD and Intel CPUs, very curious. I'll have to do some reading and we'll see whether compiling Pytorch for my particular instruction set or for my 3090 will help. Will keep you posted.

Also, I do have a SSD and I've benchmarked the loading of a block into memory, which takes 1 to 2 seconds, roughly corresponding to the order of magnitude I'd expect from 7500MB/s peak read speed vs. 4GB chunks of memory. It's really the evaluation of the block which takes those aforementioned 4 minutes.

Hi both, I can confirm, having tried to deploy and use the model on significant AWS infrastructure, that the AMD processors are not giving good performance. The difference is day and night (seconds vs long minutes) between AMD and Intel processors.

Good news. Running on GPU really is performant and reduces the evaluation time per block to 2ms. So really the only timesink is loading blocks into VRAM at ca. 3s per block. That results in roughly 3.5 minutes per token.

I'll see I whether I can compile pytorch to properly perform on my AMD CPU.

Hi, could you share the code you have used to load the blocks to GPU and cycle through the blocks?

@derduff I use the author's exact code, with the exception of setting device = "cuda:0".

You'll have to ensure you install pytorch with cuda support. I'm working in a conda env with the following pytorch and cuda toolkit: pytorch=1.12.1=py3.10_cuda11.6_cudnn8.3.2_0 and cudatoolkit=11.6.0=hecad31d_10.

For a quick check if all is well, try this

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"CUDA arch list: {torch.cuda.get_arch_list()}")
print(f"CUDNN available: {torch.backends.cudnn.is_available()}")
print(f"CUDNN version: {torch.backends.cudnn.version()}")

I don't want to hi-jack this thread... BUT I notice the creator of that article @arteagac , is here...

I used
git lfs install
export GIT_LFS_SKIP_SMUDGE=1
git clone https://huggingface.co/bigscience/bloom
cd bloom
git lfs fetch origin 2a3d62e

To retrieve the checkpoint, and it was about 350GB, (folder says 330GB... So feels right.. BUT my .BIN files for each shard is only 1 KB?)

BigScience Workshop org

@MusashiGarami GIT_LFS_SKIP_SMUDGE=1 only fetches the LFS pointers to the files. You want to clone without that set to get the actual files.

Oh... how big would the DL be without that line?

Hi, my device=cpu is doing just fine, but taking 8~ minutes per token generation, on laptop with core i7 11800H, 40GB ram, RTX 3070.
Since I want more speed when I change device=cuda:0, I get the following error:
RuntimeError: CUDA out of memory. Tried to allocate 6.70 GiB (GPU 0; 8.00 GiB total capacity; 5.08 GiB already allocated; 0 bytes free; 6.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

help needed, what to do here? really want this running on gpu. thanks.
PS: my bloom directory size is 681GB in the end, no where near 350 as others are getting.

@stormchaser If it's anything like what I've experienced, make sure that you do not call the model when you want to do inferences. Let me explain:
When you call model= blablabla, it will download it if it's not there and put it in the memory. But what you want to use for inferences is Pipe. And pipe does the same thing. So you end up with your memory being loaded twice with the same thing.

So just make sure to only call the pipe and the tokenizer.
As for the GPU side of things, I haven't tried it myself yet, as I have access to machines that are good enough to do the job on CPU and RAM.

@stormchaser If it's anything like what I've experienced, make sure that you do not call the model when you want to do inferences. Let me explain:
When you call model= blablabla, it will download it if it's not there and put it in the memory. But what you want to use for inferences is Pipe. And pipe does the same thing. So you end up with your memory being loaded twice with the same thing.

So just make sure to only call the pipe and the tokenizer.
As for the GPU side of things, I haven't tried it myself yet, as I have access to machines that are good enough to do the job on CPU and RAM.

the problem is i am not a python dev, and nor am in the AI field as such, i just make boring database apps lol, and I am so excited to run this on my machine which is reasonably good machine. What you are saying will have to be done either by the person who wrote the blog post ie,. @arteagac , or someone here who understands your point and make the change in the script we are running.

Hi, I am currently designing setup to run this on 2 nodes. I'm thinking of loading half the blocks to the GPUs on one machine, the other half to the other machine. The idea is to transfer the hidden states over LAN (or Infiniband) to the second machine. How big is the hiddenstates array? Is it only 14336 floating point numbers or more than that?

Hi, my device=cpu is doing just fine, but taking 8~ minutes per token generation, on laptop with core i7 11800H, 40GB ram, RTX 3070.
Since I want more speed when I change device=cuda:0, I get the following error:
RuntimeError: CUDA out of memory. Tried to allocate 6.70 GiB (GPU 0; 8.00 GiB total capacity; 5.08 GiB already allocated; 0 bytes free; 6.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hi @stormchaser , I suspect 8GB of GPU memory is not enough. You could try running the word embeddings layer and language model layer (the largest ones) on the CPU and only use the GPU to run the Bloom blocks.

help needed, what to do here? really want this running on gpu. thanks.
PS: my bloom directory size is 681GB in the end, no where near 350 as others are getting.

Hi, I am currently designing setup to run this on 2 nodes. I'm thinking of loading half the blocks to the GPUs on one machine, the other half to the other machine. The idea is to transfer the hidden states over LAN (or Infiniband) to the second machine. How big is the hiddenstates array? Is it only 14336 floating point numbers or more than that?

Hi @derduff . The shape of the hidden states is (n_input_tokens, hidden_size). If you have a sentence with say 10 tokens, your hidden size would be (10, 14336). You simply need to compute how much this would be in MB for the bfloat16 dtype.

I don't want to hi-jack this thread... BUT I notice the creator of that article @arteagac , is here...

I used
git lfs install
export GIT_LFS_SKIP_SMUDGE=1
git clone https://huggingface.co/bigscience/bloom
cd bloom
git lfs fetch origin 2a3d62e

To retrieve the checkpoint, and it was about 350GB, (folder says 330GB... So feels right.. BUT my .BIN files for each shard is only 1 KB?)

Hi @MusashiGarami ,
I edited the instructions. You need to additionally run git lfs checkout. Unfortunately, this downloads several linked git files, which makes the download almost 700GB. I haven't figured out a way to alternatively download only the 330GB that correspond to BLOOM.

Hi @cakiki . This sounds great. If it is fine with you, would you mind sharing a specific script to achieve this for BLOOM? I think this would benefit several people who are trying to download only the specific BLOOM checkpoint that weighs 330GB.

@stormchaser Your directory is double the size because the binaries are downloaded by git lfs pull into your .git directory (in a specific binary format). Only after you run git lfs checkout the real binary files (e.g. pytorch_model_00054-of-00072.bin) are constructed from the contents of the .git directory. After git lfs checkout has finished and after you've verified that everything works you can just delete the .git directory and free those extra 300GB.

Also, may I kindly ask those folks who have questions unrelated to this topic to simply create a new thread? There's no shame in asking, even if you think it's a "noob" question or if you're not a programmer or ml professional. Just ask away, I'll gladly share what I've learned - but not inside this mess of a thread of spaghetti messages.

Hi @arteagac . I'm trying to infer bloom on my apple silicon Mac (20c 128G), however model runs extremely slow on CPU (60s/layer, seemingly not properly parallelized) nor mps backend working properly (outputs identical token for various inputs, 0.1s/layer though).

I'm trying to mitigate this by adopting tensorflow-metal (by Apple) which presumably be more polished on macOS, however couldn't find an easy way to convert bloom pt back to tensorflow checkpoints.
I tried to mildly modify both transformers/convert_pytorch_checkpoint_to_tf2.py and transformers/commands/pt_to_tf.py and take a chance either of them would work. But no luck.
The closest I can get to is to force pt_to_tf.py to load the whole model then save it , however an SIGKILL (most likely OOM kill) kicked in.

Do you have, by any chance, an idea how to properly convert the bin file back to tensorflow ckpt layer by layer on the fly like what you did on loading then inferencing them?

@Philimipp Sorry if I deviated from the original topic as well.
I'm not sure if we are suffering from similar problem as I do think m1 share similar architecture to AMD processors in terms of CPU cache architectures and lack of AVX512 support ... and, maybe, not sure if related, or even true claim, negative optimization on non-intel CPU to OpenMP framework by Intel .

Hi @willianz . Converting the checkpoints on the fly might be time consuming. Perhaps a better option is to convert the entire model to TF's checkpoints. This should be doable by creating a script that takes each of the 72 PyTorch bin files and converts them to TF checkpoints, one by one to avoid OOM. However, even if you have the model in TF's format, you may need to write your own BLOOM Tensorflow source code, because as far as I know, the current BLOOM source code in the Transformers repo has only a PyTorch version. Have you alternatively tried running an optimized version of PyTorch for MAC (e.g., https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)?

@arteagac

Hi @willianz . Converting the checkpoints on the fly might be time consuming. Perhaps a better option is to convert the entire model to TF's checkpoints. This should be doable by creating a script that takes each of the 72 PyTorch bin files and converts them to TF checkpoints, one by one to avoid OOM. However, even if you have the model in TF's format, you may need to write your own BLOOM Tensorflow source code, because as far as I know, the current BLOOM source code in the Transformers repo has only a PyTorch version. Have you alternatively tried running an optimized version of PyTorch for MAC (e.g., https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)?

Sorry for the confusing wording on on the fly, actually I meant, as you mentioned, in comparison to what pt_to_tf.py did, to load the whole model then resave, convert them layer by layer.

you may need to write your own BLOOM Tensorflow source code

From my shallow understanding, it should be very much identical to what you've written on the Medium Blog Post, the only difference would be instead of using pretrained PyTorch Module (completely describable with TorchScript), I should be using TF Graph or TF Model with trivial loading code for those converted checkpoints.
The thing I am not sure is that from a quick glance on the Bloom Convert script. There seems to be some custom Op conversion logics included for torch module construction, I am not sure if that is also needed for tf on back convert.

Have you alternatively tried running an optimized version of PyTorch for MAC

Yes I did, actually this is exactly the 0.1s/layer "mps" backend I was talking about. =)
Actually, I have submitted an issue most likely to be the same one with smaller model size on this.

@arteagac Many thanks for this work, it's exciting to run 'essence of the internet' :) on your local env!
For the benefit of others that may still arrive here, I wanted to note that there were a few corrections needed in the code to make it running as expected, e.g. reproduce example with SQL command from the article.
Most importantly I needed to copy from BloomModel the function _prepare_attn_mask to create causal mask and pass it to block's forward - instead of original attention mask, still used to create alibi. And talking about alibi, it was the other change to make the code running at all - pass attention_mask to build_alibi_tensor function as first argument.

I was running the code on the server without GPU but with 8 cores and the sheer calculations were very fast, around a second, burning all CPU cores nicely :) Unfortunately disk was horribly slow, loading a block lasted minutes, resulting with a few hours to get one token.
Then I've moved to a machine with 2 GPUs (24GB each) and 64GB RAM (CPU) and a decent HDD. First set back was that CPU calculations didn't run multithreaded, despite decent Intel (i9 with 24 cores) on board. I've pushed all calculations to GPU and then it started to run in the reasonable limits. I've made a few efforts to speed things up, e.g. keeping everything beside blocks in GPU (embeddings, lnorm and head) and as much as possible blocks in CPU RAM - pushing them to GPU when needed. I confirm that it's possible to get on token generated below 5 minutes :)

Hi @Marek-Labuzek-at-Capg , I am glad you found the blog post useful. Perhaps you needed to make some corrections because of the version of thetransformers library you use. Are you usingtransformers==4.20.0 as specified in the blog post? In any case, I think you figured out the updates necessary to make it work with newer versions of transformers, and I am glad it worked out for you.
In terms of speed, using a very fast hard drive will definetely help. I think using a GPU helps, but the speed gains are not significant, as the bottleneck is in the reading from disk. In a hypothetical scenario where you had enough RAM to fit the entire model at once, then the GPU would significantly reduce processing times.

@arteagac you are right, that must be the reason, I'm on transformers 4.24.0, and pytroch 1.12.1 which may also make a difference.

Hi everyone,

I am quite a newbie with no prior experience in machine learning. I have followed the instructions entailed in the original article by Cristian Arteaga. What started my curiosity originally was that I was very impressed by ChatGPT but it was not always available, so I thought I might try to run something similar to it but truly opensource on my own computer, even if it weren't as good, and maybe I could tweak it when I got more familiar with it. I found Arteaga's article thru Googling and it was the first and only article I've found online with enough detailed instructions to get someone like me started. Anyway after following every step and some Googling to fill in the gaps, I was able to duplicate the result on a machine equipped with 4 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 512GB of RAM with no GPU, running Ubuntu 22.04 LTS, with the following observations:

  1. performance was about 6.8 minutes per token;
  2. CPU usage seemed to oscillate between using anywhere from 1 to 8 cores, based on %CPU reported by top, even though the machine has 80 cores;
  3. Memory usage seemed to oscillate between 6GB and 10GB;
  4. Virtual memory usage was around 25GB.

I have some questions too:

  1. Since the machine has a large amount of RAM, could this be utilized to enhance the performance?
  2. Same with the utilization of CPUs, it seems that it's only using a small percentage of the cores.
  3. How can I get it to run interactively, in a chatbot fashion?
  4. I've noticed that there are newer versions than commit ID 2a3d62e used in the original article. Do later commits require more hard drive space and RAM to run?

Thanks!

Hi @Hexeon
Nice to hear that somebody is following the same path :)

  1. As mentioned above, I've cached as much data as it was possible to fit into CPU & GPU memory and got some boost - with time per token going down from ca. 4 minutes down below 3 minutes - still not satisfying :) but always some gain...
  2. Try checking and setting number of threads pytorch is using: torch.set/get_num_threads, torch.set/get_num_interop_threads
    ADDITIONALLY: But I also observe on one of my servers (Intel based, with mkl installed - but I still need to check that all is ok there) that only few threads are used, below what is available (20 threads), while on the other, I can see that all 24 are used (for nodes forward calculations).
  3. Bloom is generative model like GPT3. chatGPT was created by finetuning GPT3 with RLHR approach (check articles by OpenAI or Huggingface). There is BLOOMZ model, check:
    https://huggingface.co/bigscience/bloomz
  4. github is saying that the commit changed only documentation :)

Hi @Marek-Labuzek-at-Capg ,
your results are interesting :) I run the script from Arteaga and get a similar problem on my computer (Intel i9 13900k, 64 GB RAM, GeForce RTX 2080 Super, Samsung 980 PRO M.2 NVMe).
@Arteaga : many thanks for you work. You helped a lot to understand the topic.

When I run the script on Ubuntu 22.04 with Kernel 6.1.11-060111-generic it takes 33 minutes for one token. I run the same script on fresh installed Win10 with newest driver and it's a little faster with 25 minutes for one token. In both cases just one core out of 32 is used when execute "block(....)" which takes the longest time.
I also checked torch.get_num_threads and torch.get_num_interop_threads. It is set to 24 and 16 by default.

Is it possible that pytorch don't support this CPU and that's the reason why the script can't run multithreaded?

Hi @EsaAI ,
As you have GPU on board you can move calculation to it. If you set variable device='cuda:0' (or whatever it is on you set-up) then it should work out of the box.

I observe modelling leaking memory, both in the setup from the article - loading blocks one by one, and same when running on a machine with big enough (CPU) memory to load whole model.
I've added used mem dumps when each block is evaluated and it constantly goes up over time - on self_attention.forwad and mlp.
The effect is that generating bigger number of blocks needs bigger memory..
Anybody observing the same? Any fixes / ideas?

Sign up or log in to comment