Unexpected mma -> mma layout conversion failed

#8
by jl303 - opened

When I trie to run the inference from qwopqwop200/GPTQ-for-LLaMa, I get "Unexpected mma -> mma layout conversion failed."
Which branch is used to quantize the model?
Here's my command.

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python llama_inference.py ../Manticore-13B-GPTQ --load ../Manticore-13B-GPTQ/Manticore-13B-GPTQ-4bit-128g.no-act-order.safetensors --wbits 4 --groupsize 128 --device 0 --text "once upon a time, "

Here's the output.

2023-05-30 10:02:43.888606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading model ...
/usr/local/lib/python3.10/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage
will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage()
instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
Found 3 unique KN Linear values.
Warming up autotune cache ...
100% 12/12 [00:40<00:00,  3.39s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
  0% 0/12 [00:00<?, ?it/s]python3: /project/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int>
> mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && 

Thanks for your help!

Oh weird.

I am currently quantising with the 'old' CUDA fork. Specifically, the fork put up by oobabooga, here: https://github.com/oobabooga/GPTQ-for-LLaMa

I use this old fork because it maximises the compatibility for the majority of people. I've found that if I use newer releases of GPTQ-for-LLaMa, it can cause various problems for users using the older versions. Unfortunately there's no version that is guaranteed to work for absolutely everyone, so I picked the version that would work for the most people.

TBH I thought it would work fine with newer GPTQ-for-LLaMas, however qwopqwop does keep changing things in his version so obviously something has broken.

To ensure compatibility with what I run, you can use either the old oobabooga fork, or the newer and better AutoGPTQ (https://github.com/PanQiWei/AutoGPTQ), which also works well. AutoGPTQ is what I recommend to everyone who can use it, and I plan to start using that for my quantisations as soon as it's ready for mass adoption. There's a couple more things that need to be ready for that to happen, most notably pre-built binaries being available. Which is coming soon.

Let me know if I can provide any further help with that.

Sign up or log in to comment