File size: 4,921 Bytes
4f4b9ff
 
4867f86
 
 
4f4b9ff
 
4867f86
4f4b9ff
065217a
4f4b9ff
4867f86
4f4b9ff
d1867a5
4f4b9ff
3ca8799
4f4b9ff
07629e3
 
 
3ca8799
 
78b264a
 
 
 
 
3ca8799
2383d4f
4f4b9ff
4867f86
4f4b9ff
4867f86
4f4b9ff
94b73f6
 
 
4867f86
94b73f6
 
1b33db0
4867f86
4f4b9ff
94b73f6
 
4867f86
4f4b9ff
4867f86
 
 
4f4b9ff
4867f86
 
 
 
 
4f4b9ff
4867f86
4f4b9ff
4867f86
 
 
 
 
 
 
4f4b9ff
4867f86
 
 
 
 
4f4b9ff
4867f86
 
 
4f4b9ff
4867f86
4f4b9ff
4867f86
4f4b9ff
4867f86
4f4b9ff
3ca8799
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
library_name: transformers
license: llama3.1
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---

# This model has been xMADified!

This repository contains [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

# Why should I use this model?

1. **Accuracy:** This xMADified model is the best **quantized** version of the `meta-llama/Llama-3.1-8B-Instruct` model. We **crush the most downloaded quantized** version(s) (see _Table 1_ below).

2. **Memory-efficiency:** The full-precision model is around 16 GB, while this xMADified model is only 5.7 GB, making it feasible to run on a 8 GB GPU.

3. **Fine-tuning**:  These models are fine-tunable over the same reduced (5.7 GB) hardware in mere 3-clicks. Watch our product demo [here](https://www.youtube.com/watch?v=S0wX32kT90s&list=TLGGL9fvmJ-d4xsxODEwMjAyNA)


## Table 1: xMAD vs. Unsloth vs. Meta

|                                                                                                                   | MMLU      | Arc Challenge | Arc Easy  | LAMBADA Standard | LAMBADA OpenAI | PIQA      | Winogrande | HellaSwag |
| ----------------------------------------------------------------------------------------------------------------- | --------- | ------------- | --------- | ---------------- | -------------- | --------- | ---------- | --------- |
| [xmadai/Llama-3.1-8B-Instruct-xMADai-INT4](https://huggingface.co/xmadai/Llama-3.1-8B-Instruct-xMADai-INT4)       | **66.83** | **52.3**      | **82.11** | **65.73**        | **73.30**      | **79.88** | **72.77**  | **58.49** |
| [unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit) | 65.91     | 51.37         | 80.89     | 63.98            | 71.49          | 79.43     | 73.80      | 58.51     |
| [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)                       | 68.05     | 51.71         | 81.9      | 66.18            | 73.55          | 79.87     | 73.72      | 59.10     |



# How to Run Model

Loading the model checkpoint of this xMADified model requires less than 6 GiB of VRAM. Hence it can be efficiently run on a 8 GB GPU.

**Package prerequisites**:

1. Run the following *commands to install the required packages.
```bash
pip install torch==2.4.0  # Run following if you have CUDA version 11.8: pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate optimum
pip install -vvv --no-build-isolation "git+https://github.com/PanQiWei/AutoGPTQ.git@v0.7.1"
```



**Sample Inference Code**

```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_id = "xmadai/Llama-3.1-8B-Instruct-xMADai-INT4"
prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "What's Deep Learning?"},
]

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

inputs = tokenizer.apply_chat_template(
    prompt,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to("cuda")

model = AutoGPTQForCausalLM.from_quantized(
    model_id,
    device_map='auto',
    trust_remote_code=True,
)

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
```

Here's a sample output of the model, using the code above:

> ["system\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant, that responds as a pirate.user\n\nWhat's Deep Learning?assistant\n\nDeep Learning be a fascinatin' field, matey! It's a form o' artificial intelligence that's based on deep neural networks, which be a type o' machine learning algorithm.\n\nYer see, traditional machine learnin' algorithms be based on shallow nets, meaning they've just one or two layers. But deep learnin' takes it to a whole new level, with multiple layers stacked on top o' each other like a chest overflowin' with booty!\n\nEach o' these layers be responsible fer processin' a different aspect o' the data, from basic features to more abstract representations. It's like navigatin' through a treasure map, with each layer helpin' ye uncover the hidden patterns and patterns hidden within the data.\n\nDeep learnin' be often used in image and speech recognition, natural language processing, and even robotics. But it be a complex and challengin' field, matey, and it requires a strong grasp o' mathematics and computer science.\n\nSo hoist the sails and set course fer the world o' deep learnin', me hearty!"]

# Contact Us

For additional xMADified models, access to fine-tuning, and general questions, please contact us at support@xmad.ai and join our waiting list.