Multi-round conversation w/ PKV cache example code

#5
by Xenova HF staff - opened

Hi there! As seen in your README, the model seemingly supports multi-round conversations. Does this also work with passing past key values? If so, could you provide example code for this, as it will dramatically improve performance? Thanks!

Hi @Xenova , i honestly do not know the answer, i will look into it to see if it is possible.

Great! It will greatly speed up time-to-first-token for the web demo I'm working on. If it doesn't work, then it's alright, it will produce the same results, just a bit slower since it needs to recompute KV cache on second run.

Okay I've got it working! Currently doesn't work in transformers due to a bug here (it always just looks at the last token when past KV cache is passed in by the user, even when user specifies > 1 new input token).

I've updated this in transformers.js and will put out a demo with this.

I've updated the model card + released the demo! :)

Model: https://huggingface.co/Xenova/nanoLLaVA
Demo: https://huggingface.co/spaces/Xenova/experimental-nanollava-webgpu
Video:

Sign up or log in to comment