TheBloke/wizardLM-7B-GPTQ · [TemporarySolution] - Using KoboldAI to inference with WizardLM-7B

Apr 28, 2023

•

edited Apr 28, 2023

Hello, friends. For those of you who are experiencing significant slowdown when it comes to inferencing with the textgenerationwebui (including myself), I have what I hope is an adequate enough solution. It's possible to load WizardLM-7B through KoboldAI, in which it performs much faster than it would've in textgen for whatever reason. This example output took roughly six seconds to generate:

In order to use KoboldAI w/ WizardLM-7B, the following needs to be done:

Download and install this fork of KoboldAI that supports loading models in 4bit quantization: https://github.com/0cc4m/KoboldAI.
Once installed, launch it by running the play.bat file. It should take you to the KoboldAI webpage.
Click the "Try New UI" button, this is the only way to use 4bit quantization.
Once it takes you to the New UI, click the "Interface" button.
Expand the "UI" dropdown menu and turn on the "Experimental UI" feature, this is what enables you to use 4bit quantization.
Put your preferred model, in this case WizardLM-7B, in the KoboldAI\models folder.
Trying to use the safetensors file won't work (at least, in my case it didn't.), you'll have to use one of the .pt files instead. They were recently removed from the HF repo, so I've uploaded them here: https://mega.nz/folder/IeUgUbaZ#C8Ng-81-DAV_qfWqbVMoEw.
Rename one of the .pt files to "4bit-128g.pt", this is how KoboldAI is able to detect which quantization and group size it should load the model in.
Click load model, select the WizardLM-7B folder, and make sure "Use 4bit mode" is turned on.
You should be able to load the model just fine and start generating. Have fun!

IonizedTexasMan

Apr 28, 2023

I was able to get it to run faster on my system. In the config.json file, I set "use_cache" to true. Went from 1.8 tokens/s to 11.5 tokens/s

Running a 2080 Ti on windows.

ddaattaa

Apr 28, 2023

I was able to get it to run faster on my system. In the config.json file, I set "use_cache" to true. Went from 1.8 tokens/s to 11.5 tokens/s

Running a 2080 Ti on windows.

Just tried it out, you were not kidding. The difference is night and day, this may very well have been the initial problem.

TheBloke

Owner Apr 28, 2023

I was able to get it to run faster on my system. In the config.json file, I set "use_cache" to true. Went from 1.8 tokens/s to 11.5 tokens/s

Running a 2080 Ti on windows.

You beauty! That was it!

Thank you! I will inform everyone the problem is solved. And know to look for that in future!!