Spaces:
Runtime error
Runtime error
Feature: llama.cpp server | |
Background: Server startup | |
Given a server listening on localhost:8080 | |
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models | |
And a model file test-model.gguf | |
And a model alias tinyllama-2 | |
And BOS token is 1 | |
And 42 as server seed | |
# KV Cache corresponds to the total amount of tokens | |
# that can be stored across all independent sequences: #4130 | |
# see --ctx-size and #5568 | |
And 256 KV cache size | |
And 32 as batch size | |
And 2 slots | |
And 64 server max tokens to predict | |
And prometheus compatible metrics exposed | |
Then the server is starting | |
Then the server is healthy | |
Scenario: Health | |
Then the server is ready | |
And all slots are idle | |
Scenario Outline: Completion | |
Given a prompt <prompt> | |
And <n_predict> max tokens to predict | |
And a completion request with no api error | |
Then <n_predicted> tokens are predicted matching <re_content> | |
And the completion is <truncated> truncated | |
And <n_prompt> prompt tokens are processed | |
And prometheus metrics are exposed | |
And metric llamacpp:tokens_predicted is <n_predicted> | |
Examples: Prompts | |
| prompt | n_predict | re_content | n_prompt | n_predicted | truncated | | |
| I believe the meaning of life is | 8 | (read\|going)+ | 18 | 8 | not | | |
| Write a joke about AI from a very long prompt which will not be truncated | 256 | (princesses\|everyone\|kids\|Anna\|forest)+ | 46 | 64 | not | | |
Scenario: Completion prompt truncated | |
Given a prompt: | |
""" | |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. | |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. | |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. | |
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | |
""" | |
And a completion request with no api error | |
Then 64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl | |
And the completion is truncated | |
And 109 prompt tokens are processed | |
Scenario Outline: OAI Compatibility | |
Given a model <model> | |
And a system prompt <system_prompt> | |
And a user prompt <user_prompt> | |
And <max_tokens> max tokens to predict | |
And streaming is <enable_streaming> | |
Given an OAI compatible chat completions request with no api error | |
Then <n_predicted> tokens are predicted matching <re_content> | |
And <n_prompt> prompt tokens are processed | |
And the completion is <truncated> truncated | |
Examples: Prompts | |
| model | system_prompt | user_prompt | max_tokens | re_content | n_prompt | n_predicted | enable_streaming | truncated | | |
| llama-2 | Book | What is the best book | 8 | (Here\|what)+ | 77 | 8 | disabled | not | | |
| codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128 | (thanks\|happy\|bird\|Annabyear)+ | -1 | 64 | enabled | | | |
Scenario Outline: OAI Compatibility w/ response format | |
Given a model test | |
And a system prompt test | |
And a user prompt test | |
And a response format <response_format> | |
And 10 max tokens to predict | |
Given an OAI compatible chat completions request with no api error | |
Then <n_predicted> tokens are predicted matching <re_content> | |
Examples: Prompts | |
| response_format | n_predicted | re_content | | |
| {"type": "json_object", "schema": {"const": "42"}} | 6 | "42" | | |
| {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10 | \[ -300 \] | | |
| {"type": "json_object"} | 10 | \{ " Jacky. | | |
Scenario: Tokenize / Detokenize | |
When tokenizing: | |
""" | |
What is the capital of France ? | |
""" | |
Then tokens can be detokenized | |
And tokens do not begin with BOS | |
Scenario: Tokenize w/ BOS | |
Given adding special tokens | |
When tokenizing: | |
""" | |
What is the capital of Germany? | |
""" | |
Then tokens begin with BOS | |
Given first token is removed | |
Then tokens can be detokenized | |
Scenario: Tokenize with pieces | |
When tokenizing with pieces: | |
""" | |
What is the capital of Germany? | |
媽 | |
""" | |
Then tokens are given with pieces | |
Scenario: Models available | |
Given available models | |
Then 1 models are supported | |
Then model 0 is identified by tinyllama-2 | |
Then model 0 is trained on 128 tokens context | |