Meta-Llama-3-8B-Instruct: "max_new_tokens" is not working for /v1/chat/completions

#137
by kuswe - opened

I've been exploring the serverless API, which lets us send payload and receive responses. But for the model meta-llama/Meta-Llama-3-8B-Instruct, I'm opting for network API calls. This model is finely tuned for website network API calls, providing well-formatted responses. By working this way, we can skip the parsing step and directly extract useful responses. I've also developed code to interact with the inference API, but I'm encountering numerous issues with parsing and extracting the relevant content from the responses, so I thought to prefer this method.

Despite increasing the max_new_tokens field in the API request to over 500 or greater value, my responses are always truncated. Here's a curl example:

curl --location 'https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct/v1/chat/completions' \
--header 'cookie: token=XXXXXXX' \
--header 'Content-Type: application/json' \
--data '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "Give me JAVA and C++ code to build a binary heap tree"
        }
    ],
    "parameters": {
        "max_new_tokens": 700,
        "stop": [
            "",
            ""
        ]
    },
    "options" : {
        "use_cache": false,
        "wait_for_model": true 
    },
    "stream": false
}'

Response:

{
    "id": "",
    "object": "text_completion",
    "created": 1717570199,
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "system_fingerprint": "2.0.5-dev0-sha-23dfd92",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Here is the code for a binary heap in Java and C++:\n\n**Java:**\n```java\npublic class BinaryHeap {\n    private int[] heap;\n\n    public BinaryHeap(int size) {\n        heap = new int[size + 1];\n        for (int i = 0; i <= size; i++) {\n            heap[i] = Integer.MAX_VALUE; // initialize all elements to infinity\n        }\n    }\n\n    public void insert(int value) {\n        int i = heap.length"
            },
            "logprobs": null,
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 23,
        "completion_tokens": 100,
        "total_tokens": 123
    }
}

Upon inspection of the response structure, I noticed the completion_tokens field, which seems to control the length. It consistently maxes out at 100, regardless of the value passed in max_new_tokens. Any insights on what might be causing this discrepancy?

Sign up or log in to comment