Text Generation
Transformers
Safetensors
English
llama
text-generation-inference
4-bit precision

comparison with bitsandbytes nf4, hope to increase GPTQ accuracy

#11
by AIReach - opened

I compared freewilly2 with bitsandbytes nf4 and this GPTQ main branch. Hope improve GPTQ version accuracy but preserve the inference speed. Thank you.

Running on two 4090s.
Based on text generation webui.
Inference Speed: GPTQ 15 tokens/s, nf4 7 tokens/s
Loading Speed: GPTQ 8s, nf4 152s
GPTQ main branch VS. bitsandbytes v0.40 nf4

Math calculation:

System:

You are an AI assistant that follows instruction extremely well. Help as much as you can. Think carefully step by step.

User:

Do math 99*99-10

Response:

Comparison: GPTQ takes number 99 as 9, nf4 calculate 99*99-10 sucessfully.

GPTQ results:
Sure, let's do the calculation step by step:
First, we need to calculate 9 * 9 which is 81 (9 multiplied by itself). So, 9 * 9 = 81.
Now, subtract 10 from this result: 81 - 10 = 71.
Therefore, the final answer is 71.

NF4 results:
To perform the calculation, follow these steps:
Step 1: Multiply 99 and 99 to find their product (99 * 99).
(99 * 99) = 9801
Step 2: Subtract 10 from the result of Step 1.
(9801 - 10) = 9791
The final answer is 9791.

Python coding problems:

System:

You are an AI assistant that follows instruction extremely well. Help as much as you can. Think carefully step by step.

User:

write a python code, which generate 10 random numbers and sort with 4 different methods from low to high.

Response:

Comparison: GPTQ failed, nf4 defined four functions without issues.

GPTQ results:
import random
numbers = [random.randint(0, 9) for _ in range(10)] # Generate 10 random integers between 0-9
print("Original list of numbers:", numbers)
sorted_list = sorted(numbers) # Sort using built-in 'sorted' function
print("Sorted using 'sorted': ", sorted_list)
sorted_list2 = list(map(lambda x: abs(x), numbers) ) # Convert negative values to positive
sorted_list3 = sorted(sorted_list2) # Sort using lambda function
print("Sorted using lambda function: ", sorted_list3)
sorted_list4 = sorted(numbers, key=abs) # Use 'key' parameter to convert negatives to positives before sorting
print("Sorted using 'key': ", sorted_list4)
sorted_list5 = sorted(numbers, reverse=True) # Reverse order (highest first)
print("Reversed sorted: ", sorted_list5[::-1])

nf4 results:
import random
from time import sleep

def bubblesort(arr):
for i in range(len(arr)):
for j in range(i+1, len(arr)):
if arr[j] < arr[i]:
arr[i], arr[j] = arr[j], arr[i]

def selectionsort(arr):
for i in range(len(arr)):
min_idx = i
for j in range(i+1, len(arr)):
if arr[j] < arr[min_idx]:
min_idx = j
arr[i], arr[min_idx] = arr[min_idx], arr[i]

def insertionsort(arr):
for i in range(1, len(arr)):
key = arr[i]
j = i-1
while j >= 0 and arr[j] > key:
arr[j+1] = arr[j]
arr[j] = key
j -= 1

def mergesort(arr):
if len(arr) <= 1:
return arr
mid = len(arr) // 2
left = arr[:mid]
right = arr[mid:]
left = mergesort(left)
right = mergesort(right)
result = []
i, j = 0, 0
while i < len(left) and j < len(right):
if left[i] < right[j]:
result.append(left[i])
i += 1
else:
result.append(right[j])
j += 1
result += left[i:] + right[j:]
return result

numbers = [random.randint(-999, 999) for _ in range(10)]
print("Original list of random numbers:", numbers)
sleep(3)

bubblesort(numbers)
print("Sorted list by Bubble Sort:", numbers)
sleep(3)

selectionsort(numbers)
print("Sorted list by Selection Sort:", numbers)
sleep(3)

insertionsort(numbers)
print("Sorted list by Insertion Sort:", numbers)
sleep(3)

mergesort(numbers)
print("Sorted list by Merge Sort:", numbers)
sleep(3)

Loading speed 8s? I have 2x3090, but the loading speed is 78.54s seconds in text-generation-webui/ExLlama, how did you load the model? Shoudn't be that different as your cards have also GDDR6X VRAM.

Thank you for your feedback AIReach.
I notice an issue with the numbers in all versions of LLaMA-2 in GPTQ. Some are sometimes forgotten, or sometimes added.
I need to test with bitsandbytes.

Loading speed 8s? I have 2x3090, but the loading speed is 78.54s seconds in text-generation-webui/ExLlama, how did you load the model? Shoudn't be that different as your cards have also GDDR6X VRAM.

I tested again, exllama-hf and exllama methods require around 10 seconds for first time new loading, and 5.2 seconds for the following reloading.

GPTQ model was stored on the 980PRO SSD, 500GB of 1T available.

Loading speed 8s? I have 2x3090, but the loading speed is 78.54s seconds in text-generation-webui/ExLlama, how did you load the model? Shoudn't be that different as your cards have also GDDR6X VRAM.

I tested again, exllama-hf and exllama methods require around 10 seconds for first time new loading, and 5.2 seconds for the following reloading.

GPTQ model was stored on the 980PRO SSD, 500GB of 1T available.

🤦‍♂️Of course the SSD is the bottleneck for loading :) I use a 500GB free on Kingston 1TB, but it's a WSL ext4 vhdx, lot room for improvement I guess :)

Edit:
Loads in 48 seconds now, just by moving it to a WD Black, only having 400GB of 2T available.

could be the reason. It is ran on a linux system. not a big deal, i will focus on bitsandbytes, which is slow.

I confirm that by using a HF version on tgwui with transformers and nf4, the issues with numbers no longer appear. That's good to know.
Currently, I have not found a version of LLaMA-2 in GPTQ that does not suffer from this number problem. I am talking about numbers because the problem is glaring with them, but I imagine it is not limited to numbers. However, numbers are important... With a simple text translation, we lose the date, the amounts mentioned, etc. In short, in their current state, the GPTQ versions are almost unusable.
The GPTQ versions of LLaMA-1 did not have this problem.
So, for now, it's bitsandbytes and we cross our fingers hoping that GPTQ will correct their issue.

thanks for deeper reseaching, tsalvoch. Hope GPTQ will solve this issue.

Does this problem extend to the GGML quant version? Looking for the best resource/performance/quality quant version for this model.

ggml has the best q4 quantization(only q4_2 not normal q4) while gptq and bitsandbytes are behind

ggml has the best q4 quantization(only q4_2 not normal q4) while gptq and bitsandbytes are behind

Thanks, that's exactly the info I needed! Greatly appreciate theBloke providing these!

ggml has the best q4 quantization(only q4_2 not normal q4) while gptq and bitsandbytes are behind

Wait, I don't see q4_2 in his GGML list, did you mean 'q4_K_M'?

Yes, I confirm that the GGML 4-bit (5 and 6-bit as well) versions do not have this issue with numbers. They work quite well with LLaMA-2, although it is about 4 times slower than with GPTQ and ExLlama. It's the best compromise for me for 13B models until I see what happens with GPTQ. For the 70B models, GGML does not seem to work yet with GPUs and requires using bitsandbytes instead.

Sign up or log in to comment