CPU or GPU to run PyTorch model on Azure?
This is not per se a code fix question. It is more like a production environment compute instance capacity problem. I have created an endpoint using this model on Azure with STANDARD_DS4_V2 (8 cores, 28 GB RAM, 56 GB disk) to score texts coming in batches. It is production environment of a call center. So you can imagine how many of rows of transcript is flowing (streaming data).
My question is: What type of compute instances you guys use for this model in production environment. It is ~400 MB PyTorch model. For inference, do you guys use CPU or GPU instance? Would it matter in inference as well? I know it is a big difference in training. But is it the same with the inference.
I observe at monitoring tab of Azure endpoints. I can see that endpoint is struggling with incoming data although auto-scaling is enabled. Any experience of running this model in production environment? Which instance types you guys are using compute optimized? memory optimized?