Spaces:

joaogante
/

tf_xla_generate_benchmarks

Running

App Files Files Community

TPU Benchmarks

by Divyanshu - opened Jul 27, 2022

Discussion

Divyanshu

Jul 27, 2022

Are there any benchmarks for this on TPU vs GPU generation?

joaogante

Owner Aug 3, 2022

•

edited Aug 3, 2022

Hi @Divyanshu -- I collected some data, but not nearly as extensive. I've also run the code as-is, so it is not optimized for TPUs.

Running on one core of a v3-8, it ran ~8x slower than an A100 for the tests I did. So with data parallelism, running on all 8 cores, they would have a similar speed. That implies that the A100 is more cost-efficient. Looking at the tensorboard profiler, we can see that it spends 50-70% of the time (depending on the model) reshaping data or updating slices. We may be able to remove this bottleneck, which would make TPUs competitive.

The tests were done with 32 bit variables and weights, and with TF32 enabled for GPUs. Fiddling with types will probably tilt the benchmarks in TPUs' favor.

Divyanshu

Aug 3, 2022

•

edited Aug 3, 2022

Hi @joaogante , thanks for letting me know. I have been trying to use google colab TPUs and hugging face accelerate for a sequence-to-sequence research project which is fast while training, but the generation seems to fall back on CPUs in a TPU runtime. This post made me optimistic that I may be able to generate on TPUs to make my workflow faster. Anyways, are you planning to remove the bottleneck you mentioned in the comment above sometime shortly in future releases? Also, I have been using hugging face accelerate and transformers to generate text on TPUs, as mentioned here https://github.com/huggingface/transformers/issues/12322, but they seem to keep falling back on CPUs. Any idea if we can use XLA and PyTorch to generate text using TPUs?

Anyways, Thanks for letting me know about the benchmarks!

joaogante

Owner Aug 4, 2022

I have been trying to use google colab TPUs and hugging face accelerate for a sequence-to-sequence research project which is fast while training, but the generation seems to fall back on CPUs in a TPU runtime.

@Divyanshu That is because the pytorch text generation function has to undergo the same refactor as its TF counterpart just did :D Like the previous TF implementation, it contains operations that can't be mapped to XLA, which explains what you're seeing. I am unable to estimate a timeline for a refactor, as our focus for the moment is to make text generation more user-friendly -- perhaps late Q4 :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment