bayley/Midnight-Miqu-70B-v1.5-q4f16_1-MLC

sophosympatheia/Midnight-Miqu-70B-v1.5 quantized for use with MLC-LLM

MLC-LLM is an Apache TVM based inference framework with a neat trick: out of all of the frameworks that support tensor parallel inference, MLC-LLM is by far the easiest to install for single-user inference. This allows for linear-ish performance scaling on 2x 3090, 2x 4090, or 2x 7900 XTX, achieving about 30 tokens per second. In particular, this is faster than the performance from a single 48GB card costing much more.

MLC-LLM requires Ubuntu 22.04 or above (the prebuilt wheels will not work on Ubuntu 20.04). For CUDA users:

python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

git clone https://huggingface.co/bayley/Midnight-Miqu-70B-v1.5-q4f16_1-MLC

mlc_llm compile Midnight-Miqu-70B-v1.5-q4f16_1-MLC/mlc-chat-config.json --device cuda --overrides "tensor_parallel_shards=<number of gpus>" -o Midnight-Miqu-70B-v1.5-q4f16_1-cuda.so

mlc_llm serve Midnight-Miqu-70B-v1.5-q4f16_1-MLC --model-lib Midnight-Miqu-70B-v1.5-q4f16_1-cuda.so --host 0.0.0.0

This should start an OpenAI-compatible REST server serving a chat completions endpoint that you can connect to with your favorite frontend.

This quant inheirits Midnight-Miqu-70b's complex licensing history, which includes the leaked miqu-70b as part of its data. As such, it should probably not be used for anything that matters.