Spaces:

yusufs
/

vllm-inference

Paused

yusufs commited on Nov 27, 2024

Commit

6b1968a

1 Parent(s): c41cdb4

fix(remove): use_cached_output is not an option

Files changed (1) hide show

main.py CHANGED Viewed

@@ -64,7 +64,6 @@ engine_llama_3_2: LLM = LLM(
     # Your Tesla T4 GPU has compute capability 7.5.
     # You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
     dtype='half',                  # Use 'half' for T4
-    use_cached_outputs=True,      # Enable caching
 )
 # ValueError: max_num_batched_tokens (512) is smaller than max_model_len (32768).
@@ -80,7 +79,6 @@ engine_sailor_chat: LLM = LLM(
     max_model_len=32768,
     enforce_eager=True,              # Disable CUDA graph
     dtype='half',                    # Use 'half' for T4
-    use_cached_outputs=True,         # Enable caching
 )

     # Your Tesla T4 GPU has compute capability 7.5.
     # You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
     dtype='half',                  # Use 'half' for T4
 )
 # ValueError: max_num_batched_tokens (512) is smaller than max_model_len (32768).
     max_model_len=32768,
     enforce_eager=True,              # Disable CUDA graph
     dtype='half',                    # Use 'half' for T4
 )