Spaces

Duplicated from radames/llama-cpp-python-cuda-gradio

SpacesExamples
/

llama-cpp-python-cuda-gradio

Runtime error

App Files Files Community

runtime error

Exit code: 1. Reason: 33 layers to GPU .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 159.19 MiB llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB llama_new_context_with_model: total VRAM used: 4505.56 MiB (model: 4349.55 MiB, context: 156.00 MiB) AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | Caching examples at: '/home/user/app/gradio_cached_examples/16' Caching example 1/1 IMPORTANT: You are using gradio version 4.12.0, however version 4.44.1 is available, please upgrade. -------- llama_print_timings: load time = 1222.15 ms llama_print_timings: sample time = 319.47 ms / 608 runs ( 0.53 ms per token, 1903.15 tokens per second) llama_print_timings: prompt eval time = 1221.84 ms / 147 tokens ( 8.31 ms per token, 120.31 tokens per second) llama_print_timings: eval time = 45965.70 ms / 607 runs ( 75.73 ms per token, 13.21 tokens per second) llama_print_timings: total time = 49796.44 ms Traceback (most recent call last): File "/home/user/app/app.py", line 71, in <module> demo.queue(concurrency_count=1, max_size=5) File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1715, in queue raise DeprecationWarning( DeprecationWarning: concurrency_count has been deprecated. Set the concurrency_limit directly on event listeners e.g. btn.click(fn, ..., concurrency_limit=10) or gr.Interface(concurrency_limit=10). If necessary, the total number of workers can be configured via `max_threads` in launch().

Container logs:

Fetching error logs...