Prompt and performance

#1
by Sciumo - opened

It was unclear what the prompt should be. Shouldn't model cards have the associated prompts?
Here is what I used:
template = """{instruct}
USER: {question}
ASSISTANT:
"""
The performance was 446.87 ms per token on a TR Pro 3995 with 64 cores and 256 GB RAM. I classify that as slow.
Apparently CPU doesn't really help, with a single NUMA just spins on memory access. I'm going to try https://github.com/huggingface/text-generation-inference

mspertoken_1.png

Here is the utilization:
cpu_1.png

Sign up or log in to comment