Spaces:

DanofficeIT
/

privatellm

Runtime error

privatellm / examples /server /tests /features /results.feature

lhhj

first

57e3690 19 days ago

4.35 kB

	@llama.cpp
	@results
	Feature: Results

	Background: Server startup
	Given a server listening on localhost:8080
	And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
	And a model file test-model-00001-of-00003.gguf
	And 128 as batch size
	And 1024 KV cache size
	And 128 max tokens to predict
	And continuous batching

	Scenario Outline: consistent results with same seed
	Given <n_slots> slots
	And 1.0 temperature
	Then the server is starting
	Then the server is healthy

	Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42

	Given concurrent completion requests
	Then the server is busy
	Then the server is idle
	And all slots are idle
	Then all predictions are equal
	Examples:
	\| n_slots \|
	\| 1 \|
	# FIXME: unified KV cache nondeterminism
	# \| 2 \|

	Scenario Outline: different results with different seed
	Given <n_slots> slots
	And 1.0 temperature
	Then the server is starting
	Then the server is healthy

	Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
	Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
	Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
	Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45

	Given concurrent completion requests
	Then the server is busy
	Then the server is idle
	And all slots are idle
	Then all predictions are different
	Examples:
	\| n_slots \|
	\| 1 \|
	\| 2 \|

	Scenario Outline: consistent results with same seed and varying batch size
	Given 4 slots
	And <temp> temperature
	# And 0 as draft
	Then the server is starting
	Then the server is healthy

	Given 1 prompts "Write a very long story about AI." with seed 42
	And concurrent completion requests
	# Then the server is busy # Not all slots will be utilized.
	Then the server is idle
	And all slots are idle

	Given <n_parallel> prompts "Write a very long story about AI." with seed 42
	And concurrent completion requests
	# Then the server is busy # Not all slots will be utilized.
	Then the server is idle
	And all slots are idle

	Then all predictions are equal
	Examples:
	\| n_parallel \| temp \|
	\| 1 \| 0.0 \|
	\| 1 \| 1.0 \|
	# FIXME: unified KV cache nondeterminism
	# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
	# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
	# and https://github.com/ggerganov/llama.cpp/pull/7347 .
	# \| 2 \| 0.0 \|
	# \| 4 \| 0.0 \|
	# \| 2 \| 1.0 \|
	# \| 4 \| 1.0 \|

	Scenario Outline: consistent token probs with same seed and prompt
	Given <n_slots> slots
	And <n_kv> KV cache size
	And 1.0 temperature
	And <n_predict> max tokens to predict
	Then the server is starting
	Then the server is healthy

	Given 1 prompts "The meaning of life is" with seed 42
	And concurrent completion requests
	# Then the server is busy # Not all slots will be utilized.
	Then the server is idle
	And all slots are idle

	Given <n_parallel> prompts "The meaning of life is" with seed 42
	And concurrent completion requests
	# Then the server is busy # Not all slots will be utilized.
	Then the server is idle
	And all slots are idle

	Then all token probabilities are equal
	Examples:
	\| n_slots \| n_kv \| n_predict \| n_parallel \|
	\| 4 \| 1024 \| 1 \| 1 \|
	# FIXME: unified KV cache nondeterminism
	# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
	# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
	# and https://github.com/ggerganov/llama.cpp/pull/7347 .
	# \| 4 \| 1024 \| 1 \| 4 \|
	# \| 4 \| 1024 \| 100 \| 1 \|
	# This test still fails even the above patches; the first token probabilities are already different.
	# \| 4 \| 1024 \| 100 \| 4 \|