Every response start with <|start_header_id|>assistant<|end_header_id|>
#2
by
notadib
- opened
vllm parameters:
vllm serve cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic --max-model-len 32000 --max_num_batched_tokens 32000 -tp 2 --max_num_seqs 256 --gpu-memory-utilization 0.95 --tokenizer-pool-size 4 --num_scheduler_steps 16 --max_logprobs 20
It seems to be related to the tokenizer config. Since this quantization uses the same configurations as the original model, the error is unlikely caused by the quantization.
Did you encounter the same error running the original model?