Why the embedding on CPU is far faster than GPU?

#10
by ViGeng - opened

Hello guys!
I am playing around with ViT inference speed. I have tested the cost time of embedding and encoding on CPU and GPU separately.
The results, out of my expectation, are:

batch_size embedding device encoder device img_processor time embedding time encoder time total time
1 CPU CPU 4 1 71 73
1 CPU GPU 3 1 164 166
1 GPU GPU 4 330 5 349
16 GPU GPU 47 319 7 326
16 CPU CPU 54 8 961 970

time unit: ms
GPU model = RTX 3090Ti
CPU model = Intel i9-12900KF
Pretrained model weights = google/vit-base-patch16-224-in21k

I can understand that GPU is faster than CPU for encoding. But

  • Why is it faster than the GPU for embedding since both embedding and encoding are some DL neural networks and do matrix multiplying operations?
  • When I use CPU for embedding and GPU for encoding, I found I can save time for embedding but lose some time for encoding. I can also not explain why.

Here I attached my test class and you may need to use some .to(device) to modify both the class and ViTModel to specify where the code is running:

import time

from PIL import Image
from transformers import ViTImageProcessor, ViTModel


class ObjectDetector:
    def __init__(self, cuda_device='cuda:0'):
        self.device = cuda_device

        self.img_processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224-in21k')

        self.model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k').eval()
        # self.model.embeddings = self.model.embeddings.to('cpu').eval()

    # load single or multiple images
    def extract(self, image):
        before_time = time.time()
        inputs = self.img_processor(images=image, return_tensors="pt")
        after_img_processor = time.time()
        outputs = self.model(**inputs)
        after_model = time.time()
        print(f"Time taken for image processor: {after_img_processor - before_time}")
        print(f"Time taken for model: {after_model - after_img_processor}")
        return outputs


def main():
    detector = ObjectDetector()
    images = [Image.open(f'/home/rowan/source/edge-apps/datasets/batch/{i}.jpg') for i in range(16) ]
    outputs = detector.extract(images)
    print(outputs.keys())

if __name__ == "__main__":
    main()

Any comments and discussion will be appreciated!

Okay, finally I found:

  • The first batch inference will appear slowly but the following batches will act as expected.
Google org

@ViGeng
You're absolutely right. CPU embedding might seem faster for ViT model's first inference due to avoided GPU memory allocation and data transfer. But this is deceptive. Subsequent GPU inferences are significantly faster due to cached memory and optimized data transfers.

Sign up or log in to comment