Edit model card

Custom handler for HF Inference Endpoint for LLMLingua

LLMLingua

https://github.com/microsoft/LLMLingua
https://llmlingua.com/

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss

Model: NousResearch/Llama-2-7b-hf

https://huggingface.co/NousResearch/Llama-2-7b-hf

Inference Endpoint Configuration

Task: Custom Container Type: Default Instance Type: GPU Nvidia A10G 24Gb

Usage

Sample payload

{
        "inputs": "A long prompt to optimize for the LLM",
        "parameters": {
            "instruction": "",
            "question": "",
            "target_token": 200,
            "context_budget": "*1.5",
            "iterative_size": 100
        }
    }

Prompt sample text:
https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

Expected output

{
    "compressed_prompt": "Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 00ters in total\nSam then took5 boxes 6ters0ters\nHe sold these boxes for 5 *5\nAfterelling these  boxes there were 30330ters remaining\nese form 330 /30 of three\n sold each for2 each, so made * =0 from\n total, he0 $15\nSince his original1 he earned $120 = $115 in profit.\nThe answer is 115",
    "origin_tokens": 2365,
    "compressed_tokens": 174,
    "ratio": "13.6x",
    "saving": ", Saving $0.1 in GPT-4."
}
Downloads last month
0
Safetensors
Model size
6.74B params
Tensor type
F32
·
FP16
·
Inference Examples
Inference API (serverless) has been turned off for this model.