Serverless GPUs to scale your machine learning inference without any hassle of managing servers, deploy complicated and custom models with ease.

Go through this tutorial, for quickly deploy Mixtral-8x7B-v0.1 using Inferless

Mixtral-8x7B - GPTQ

Model creator: Mistralai
Original model: Mixtral-8x7B-v0.1

Description

This repo contains GPTQ model files for Mistralai's Mixtral-8x7B-v0.1.

About GPTQ

GPTQ is a method that compresses the model size and accelerates inference by quantizing weights based on a calibration dataset, aiming to minimize mean squared error in a single post-quantization step. GPTQ achieves both memory efficiency and faster inference.

It is supported by:

Text Generation Webui - using Loader: AutoAWQ
vLLM - version 0.2.2 or later for support for all model types.
Hugging Face Text Generation Inference (TGI)
Transformers version 4.35.0 and later, from any code or client that supports Transformers
AutoAWQ - for use from Python code

Shared files, and GPTQ parameters

Models are released as sharded safetensors files.

Branch	Bits	GS	AWQ Dataset	Seq Len	Size
main	4	128	VMware Open Instruct	4096	5.96 GB

How to use

You will need the following software packages and python libraries:

build:
  cuda_version: "12.1.1"
  system_packages:
    - "libssl-dev"
  python_packages:
    - "torch==2.1.2"
    - "vllm==0.2.6"
    - "transformers==4.36.2"
    - "accelerate==0.25.0"

Inferless
/

Mixtral-8x7B-v0.1-int8-GPTQ

Mixtral-8x7B - GPTQ

Description

About GPTQ

Shared files, and GPTQ parameters

How to use

Model tree for Inferless/Mixtral-8x7B-v0.1-int8-GPTQ

Evaluation results