Spaces:
Runtime error
Runtime error
Feature: Results | |
Background: Server startup | |
Given a server listening on localhost:8080 | |
And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models | |
And a model file test-model-00001-of-00003.gguf | |
And 128 as batch size | |
And 1024 KV cache size | |
And 128 max tokens to predict | |
And continuous batching | |
Scenario Outline: consistent results with same seed | |
Given <n_slots> slots | |
And 1.0 temperature | |
Then the server is starting | |
Then the server is healthy | |
Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42 | |
Given concurrent completion requests | |
Then the server is busy | |
Then the server is idle | |
And all slots are idle | |
Then all predictions are equal | |
Examples: | |
| n_slots | | |
| 1 | | |
# FIXME: unified KV cache nondeterminism | |
# | 2 | | |
Scenario Outline: different results with different seed | |
Given <n_slots> slots | |
And 1.0 temperature | |
Then the server is starting | |
Then the server is healthy | |
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42 | |
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43 | |
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44 | |
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45 | |
Given concurrent completion requests | |
Then the server is busy | |
Then the server is idle | |
And all slots are idle | |
Then all predictions are different | |
Examples: | |
| n_slots | | |
| 1 | | |
| 2 | | |
Scenario Outline: consistent results with same seed and varying batch size | |
Given 4 slots | |
And <temp> temperature | |
# And 0 as draft | |
Then the server is starting | |
Then the server is healthy | |
Given 1 prompts "Write a very long story about AI." with seed 42 | |
And concurrent completion requests | |
# Then the server is busy # Not all slots will be utilized. | |
Then the server is idle | |
And all slots are idle | |
Given <n_parallel> prompts "Write a very long story about AI." with seed 42 | |
And concurrent completion requests | |
# Then the server is busy # Not all slots will be utilized. | |
Then the server is idle | |
And all slots are idle | |
Then all predictions are equal | |
Examples: | |
| n_parallel | temp | | |
| 1 | 0.0 | | |
| 1 | 1.0 | | |
# FIXME: unified KV cache nondeterminism | |
# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 | |
# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 | |
# and https://github.com/ggerganov/llama.cpp/pull/7347 . | |
# | 2 | 0.0 | | |
# | 4 | 0.0 | | |
# | 2 | 1.0 | | |
# | 4 | 1.0 | | |
Scenario Outline: consistent token probs with same seed and prompt | |
Given <n_slots> slots | |
And <n_kv> KV cache size | |
And 1.0 temperature | |
And <n_predict> max tokens to predict | |
Then the server is starting | |
Then the server is healthy | |
Given 1 prompts "The meaning of life is" with seed 42 | |
And concurrent completion requests | |
# Then the server is busy # Not all slots will be utilized. | |
Then the server is idle | |
And all slots are idle | |
Given <n_parallel> prompts "The meaning of life is" with seed 42 | |
And concurrent completion requests | |
# Then the server is busy # Not all slots will be utilized. | |
Then the server is idle | |
And all slots are idle | |
Then all token probabilities are equal | |
Examples: | |
| n_slots | n_kv | n_predict | n_parallel | | |
| 4 | 1024 | 1 | 1 | | |
# FIXME: unified KV cache nondeterminism | |
# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 | |
# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 | |
# and https://github.com/ggerganov/llama.cpp/pull/7347 . | |
# | 4 | 1024 | 1 | 4 | | |
# | 4 | 1024 | 100 | 1 | | |
# This test still fails even the above patches; the first token probabilities are already different. | |
# | 4 | 1024 | 100 | 4 | | |