cortecs/Meta-Llama-3-70B-Instruct-GPTQ

This is a quantized model of Llama-3 70B Instruct using GPTQ developed by IST Austria using the following configuration:

4bit (8bit will follow)
Act order: True
Group size: 128

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/Meta-Llama-3-70B-Instruct-GPTQ

Access the model:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d ' {
        "model": "cortecs/Meta-Llama-3-70B-Instruct-GPTQ",
        "prompt": "San Francisco is a"
    } '

Evaluations

English	Llama-3 70B Instruct	Llama 3 70B Instruct GPTQ	Mixtral Instruct
Avg.	76.19	75.14	73.17
ARC	71.6	70.7	71.0
Hellaswag	77.3	76.4	77.0
MMLU	79.66	78.33	71.52

French	Llama-3 70B Instruct	Llama 3 70B Instruct GPTQ	Mixtral Instruct
Avg.	70.97	70.27	68.7
ARC_fr	65.0	64.7	63.9
Hellaswag_fr	72.4	71.4	77.1
MMLU_fr	75.5	74.7	65.1

German	Llama-3 70B Instruct	Llama 3 70B Instruct GPTQ	Mixtral Instruct
Avg.	68.43	66.93	66.47
ARC_de	64.2	62.6	62.8
Hellaswag_de	67.8	66.7	72.1
MMLU_de	73.3	71.5	64.5

Italian	Llama-3 70B Instruct	Llama 3 70B Instruct GPTQ	Mixtral Instruct
Avg.	70.17	68.63	67.17
ARC_it	64.0	62.1	63.8
Hellaswag_it	72.6	71.0	75.6
MMLU_it	73.9	72.8	62.1

Safety	Llama-3 70B Instruct	Llama 3 70B Instruct GPTQ	Mixtral Instruct
Avg.	64.28	63.64	63.56
RealToxicityPrompts	97.9	98.1	93.2
TruthfulQA	61.91	59.91	64.61
CrowS	33.04	32.92	32.86

Spanish	Llama-3 70B Instruct	Llama 3 70B Instruct GPTQ	Mixtral Instruct
Avg.	72.5	71.3	68.8
ARC_es	66.7	65.7	64.4
Hellaswag_es	75.8	74	77.5
MMLU_es	75	74.2	64.6

Take with caution. We did not check for data contamination. Evaluation was done using Eval. Harness using limit=1000 for big datasets.

Performance

	requests/s	tokens/s
NVIDIA L40Sx2	2	951.28

cortecs
/

Meta-Llama-3-70B-Instruct-GPTQ

Usage

Evaluations

Performance

Dataset used to train cortecs/Meta-Llama-3-70B-Instruct-GPTQ