Custom handler for HF Inference Endpoint for LLMLingua
LLMLingua
https://github.com/microsoft/LLMLingua
https://llmlingua.com/
To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss
Model: NousResearch/Llama-2-7b-hf
https://huggingface.co/NousResearch/Llama-2-7b-hf
Inference Endpoint Configuration
Task: Custom Container Type: Default Instance Type: GPU Nvidia A10G 24Gb
Usage
Sample payload
{
"inputs": "A long prompt to optimize for the LLM",
"parameters": {
"instruction": "",
"question": "",
"target_token": 200,
"context_budget": "*1.5",
"iterative_size": 100
}
}
Prompt sample text:
https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt
Expected output
{
"compressed_prompt": "Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 00ters in total\nSam then took5 boxes 6ters0ters\nHe sold these boxes for 5 *5\nAfterelling these boxes there were 30330ters remaining\nese form 330 /30 of three\n sold each for2 each, so made * =0 from\n total, he0 $15\nSince his original1 he earned $120 = $115 in profit.\nThe answer is 115",
"origin_tokens": 2365,
"compressed_tokens": 174,
"ratio": "13.6x",
"saving": ", Saving $0.1 in GPT-4."
}
- Downloads last month
- 0