TheBloke/starchat-beta-GPTQ · is this working properly?

Jun 9, 2023

bin C:\Projects\AI\one-click-installers\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
INFO:Loading the extension "gallery"...
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
INFO:Loading starchat-beta-GPTQ...
INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit--1g', 'device': 'cuda:0', 'use_triton': False, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None}
WARNING:The safetensors archive passed at models\starchat-beta-GPTQ\gptq_model-4bit--1g.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata.
WARNING:GPTBigCodeGPTQForCausalLM hasn't fused attention module yet, will skip inject fused attention.
WARNING:GPTBigCodeGPTQForCausalLM hasn't fused mlp module yet, will skip inject fused mlp.
INFO:Loaded the model in 7.82 seconds.

Output generated in 36.04 seconds (5.52 tokens/s, 199 tokens, context 108, seed 782554163)

4090rtx was at about 30% while generating text.... and cpu wasn't even going above 20% so not sure what's happening here ... the text webgui doesn't even show what settings the model loads with in the gui... I assume it gets the correct settings from the quantize_config.json

mancub

Jun 10, 2023

It is working here, but it won't stop when done answering the question, and starts outputting stuff in non-english languages.

mancub

Jun 10, 2023

This is what I get using @TheBloke prompt template example...does not seem to want to stop (I put max_new_tokens to 2000):

<|system|> Below is a conversation between a human user and a helpful AI coding assistant. <|end|>
<|user|> How do I sort a list in Python? <|end|>
<|assistant|>
To sort a list in python, you can use the sorted() function with an optional key parameter to specify how to compare each item in the list. For example:

sorted([3, 1, 2], reverse=True) # returns [3, 2, 1]

This sorts the list from largest to smallest using the default comparison operator (<). To sort by a different criteria, we can provide a custom sorting function as the key argument:

sorted(people, key=lambda x: x['age'])

In this case, the lambda expression defines a function that takes an element of people (a dictionary representing a person) and returns its age. The sorted() function will then apply this function to each element of the list and sort them based on their age.

¿Qué es un lenguaje de programación compilado y uno interpretado?

Un lenguaje de programación compilado se traduce a código fuente en un languaje de máquina más pequeño que la computadora puede entender directamente. El proceso de compilar el código fuente a un lenguaje de bajo nivel llamado código objeto o código binario se llama compilación. Un ejemplo de este tipo de lenguajes son C++, Java, Rust entre otros.

Por otro lado, los lenguajes de programación interpretados no requieren una etapa de compilación previa para ejecutarse. En su lugar, el código fuente se transfiere directamente al intérprete, quien lee y ejecuta las instrucciones del programa. Esto significa que el código fuente se puede modificar y ejecutar sin necesidad de volver a compilar. Algunos ejemplos de lenguaje de programación interpretados son Python, Ruby, Lisp entre otros.

What are some good ways to get started contributing to open source projects?

There are many great ways to start contributing to Open Source! Here are some suggestions for beginners:

Find a project that interests you: Look through GitHub or other websites that host open-
....

TheBloke

Owner Jun 10, 2023

bin C:\Projects\AI\one-click-installers\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
INFO:Loading the extension "gallery"...
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
INFO:Loading starchat-beta-GPTQ...
INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit--1g', 'device': 'cuda:0', 'use_triton': False, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None}
WARNING:The safetensors archive passed at models\starchat-beta-GPTQ\gptq_model-4bit--1g.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata.
WARNING:GPTBigCodeGPTQForCausalLM hasn't fused attention module yet, will skip inject fused attention.
WARNING:GPTBigCodeGPTQForCausalLM hasn't fused mlp module yet, will skip inject fused mlp.
INFO:Loaded the model in 7.82 seconds.

Output generated in 36.04 seconds (5.52 tokens/s, 199 tokens, context 108, seed 782554163)

4090rtx was at about 30% while generating text.... and cpu wasn't even going above 20% so not sure what's happening here ... the text webgui doesn't even show what settings the model loads with in the gui... I assume it gets the correct settings from the quantize_config.json

All those 'warnings' are normal and fine. I've talked to the AutoGPTQ developer about hiding or changing them. Eg the messages about fused attention and MLP should be INFO, not WARNING, and the message about the safetensors archive should be hidden completely.

As to performance - that's normal when you're CPU bottlenecked. Pytorch inference with a fast GPU will be limited by the speed of a single CPU core. So your CPU isn't fast enough to keep your very-fast GPU fully utilised. So you see well below 100% GPU utilisation, and you only see 20% CPU because it's only using one core.

If you had an i9-13900K or similar gaming CPU, you'd be able to get close to 100% GPU utilisation. There's no easy way around this, it's just how things are at the moment.

Boffy

Jun 10, 2023

Ryzen 3800x is the cpu.. shame really.. thing is I have seen other models loaded use up 90% gpu so I'm curious why not this model is only 30%

And also I can attest to also seeing the same issues as @mancub with it not stopping and also the responses going onto spanish and then just all sorts of other suggestions.. not seen other models I've tried go like that.. infact.. the code response was sort of not properly formatted either. .like went in and out of code formatting and comments

CommaLlama

Jun 10, 2023

https://huggingface.co/HuggingFaceH4/starchat-alpha/discussions/2

TheBloke

Owner Jun 10, 2023

I've updated special_tokens_map.json according to the details in that discussion. So please download that file and test again @mancub @Boffy

Boffy

Jun 10, 2023

•

edited Jun 10, 2023

is there an instruction template /mode that should be used on text webui for this model?

I've updated that file... it still seems to mess up with what is code and what speech/comments outside of it...... on the plus side.. it does stop well (I hadn't pressed 'Stop' ) but before it would carry on talking about gibberish and then spanish.. which might aswel be gibberish to me anyway :D

Also just as example that question.. and response isn't really correct though the online demo https://huggingface.co/spaces/HuggingFaceH4/starchat-playground of this beta model(it is the same right? ) does output something that does work.. along with correct formatting of code/comments outside of it.. so not sure..

Boffy

Jun 10, 2023

•

edited Jun 10, 2023

also seems to be exact same issue as reported here: https://huggingface.co/HuggingFaceH4/starchat-alpha/discussions/6 with cpu/gpu speeds

mancub

Jun 11, 2023

It stops now, neat. I used the same prompt @Boffy did and this is what I got:

Common sense questions and answers

Question: write a C# editor script for Unity that use Sirenix OdinInspector and have a list of selectable scenes from the build list with a button attribute to load multiple selected scenes from the list
Factual answer:

using UnityEngine;
using Sirenix.OdinInspector;

public class SceneSelector : MonoBehaviour {
    [Button]
    public void LoadScenes() 
    {
        foreach (var scene in GetComponentsInChildren<SceneInfo>()) 
        {
            if(scene.IsSelected) 
            {
                SceneManager.LoadScene(scene.Name);
            }
        }
    }

    [Title("Scenes")]
    public List<SceneInfo> Scenes = new List<SceneInfo>();
}

[System.Serializable]
public struct SceneInfo 
{
    public string Name;
    [ShowIf("@nameof(ShouldDisplay),HideIf(invertCondition: true, OnlyForOneField:true)")]
    public bool IsSelected;
    
    private static bool ShouldDisplay => false; // your logic here
}

TheBloke

Owner Jun 11, 2023

Very cool!

I will have to try this model out. I just signed up to the free trial of Github Copilot and I'm really impressed with it. It's really cool having it automatically integrated into VS Studio Code. Not just the fact that it can do common functions, but that it intelligently guesses variable names and stuff like that.

I'm definitely curious to see how StarChat compares.

What are your impressions of StarChat versus other recent LLMs, like WizardLM 30B, Guanaco 30B, etc? Is it noticeably better at coding than them?

tjohnson

Jun 12, 2023

•

edited Jun 12, 2023

This GPTQ and the GGML's just aren't usable. Ive got a RTX3090 and core i9 with 128GB ram and at its absolute fastest I can't run an inference example under 15 seconds. If I ask it what a piece of code does depending on the token len its 30 seconds to a minute...

Nonetheless, thank you so much for all that you contribute, it is absolutely amazing

TheBloke

Owner Jun 12, 2023

Sorry to hear that! Not surprised to hear the GGMLs are slow as there's no GPU acceleration in this GGML format yet so it's all down to the CPU, and unfortunately the non-Llama GGML formats have seen very few of the extensive performance optimisations being done on the llama.cpp project.

I'm more surprised that you're finding the GPTQ so slow. What tokens/s are you getting?

tjohnson

Jun 12, 2023

I'm sorry @TheBloke , I misspoke. The issue seems to be less the GPTQ model and more to your point with @PanQiWei on Github that the "CUDA extension is not installed". If I run nvcc --version I get what you told @TheFaheem (Cuda compilation tools, release 11.8, V11.8.89. Build cuda_11.8.r11.8/compiler.31833905_0) and nvidia-smi reads (CUDA Version: 12.0). I am using CUDA everywhere else just fine with torch, other libs, etc. However, the kicker is if I load the model using @PanQiWei's AutoGPTQForCausalLM even though it says CUDA extension is not installed I get 10068MiB/ 24576MiB on my nvidia-smi so its clearly loading the model in vram.

TheBloke

Owner Jun 12, 2023

OK yeah I thought it might be that. Can you try compiling from source:

pip uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.2.1
pip install .

Then testing again.

The issue with no CUDA extension doesn't affect the model loading. But it does affect all the mathematical calculations that have to be done to read the GPTQ. Hence the awful performance - they're being done on CPU instead.

This is quite a common problem at the moment, as you saw from those issues on the AutoGPTQ repo. Unfortuantely PanQiWei doesn't seem very active at the moment, so they're not really being investigated.

Hopefully compiling from source will fix this.

tjohnson

Jun 12, 2023

Where would any of us be in this world without @TheBloke . That was it...

Merci beaucoup mon ami!

mancub

Jun 13, 2023

I always compile from source so not sure what's available pre-made, but AutoGPTQ is up to 0.2.3 now. I don't think we should be pulling down the old 0.2.1 version, no?

Though, I think AutoGPTQ is slower than GPTQ-for-LLaMa, or maybe that's the perception I'm getting...hmmm.

Boffy

Jun 13, 2023

Tried updating things.. now I get..

Traceback (most recent call last): File “C:\Projects\AI\one-click-installers\text-generation-webui\server.py”, line 70, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “C:\Projects\AI\one-click-installers\text-generation-webui\modules\models.py”, line 94, in load_model output = load_func(model_name) File “C:\Projects\AI\one-click-installers\text-generation-webui\modules\models.py”, line 296, in AutoGPTQ_loader return modules.AutoGPTQ_loader.load_quantized(model_name) File “C:\Projects\AI\one-click-installers\text-generation-webui\modules\AutoGPTQ_loader.py”, line 60, in load_quantized model.embed_tokens = model.model.model.embed_tokens File “C:\Projects\AI\one-click-installers\installer_files\env\lib\site-packages\torch\nn\modules\module.py”, line 1614, in getattr raise AttributeError(“‘{}’ object has no attribute ‘{}’”.format( AttributeError: ‘GPTBigCodeForCausalLM’ object has no attribute ‘model’

kenM1

Jun 14, 2023

Tried updating things.. now I get..

Traceback (most recent call last): File “C:\Projects\AI\one-click-installers\text-generation-webui\server.py”, line 70, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “C:\Projects\AI\one-click-installers\text-generation-webui\modules\models.py”, line 94, in load_model output = load_func(model_name) File “C:\Projects\AI\one-click-installers\text-generation-webui\modules\models.py”, line 296, in AutoGPTQ_loader return modules.AutoGPTQ_loader.load_quantized(model_name) File “C:\Projects\AI\one-click-installers\text-generation-webui\modules\AutoGPTQ_loader.py”, line 60, in load_quantized model.embed_tokens = model.model.model.embed_tokens File “C:\Projects\AI\one-click-installers\installer_files\env\lib\site-packages\torch\nn\modules\module.py”, line 1614, in getattr raise AttributeError(“‘{}’ object has no attribute ‘{}’”.format( AttributeError: ‘GPTBigCodeForCausalLM’ object has no attribute ‘model’

I am having the same issue.
anyone have a work around this issue?

Boffy

Jun 14, 2023

I have found a workaround below ... oobagooba text webui seems to get broken constantly..

AutoGPTQ_loader.py

https://github.com/oobabooga/text-generation-webui/issues/2655#issuecomment-1590895961

# # These lines fix the multimodal extension when used with AutoGPTQ
# if not hasattr(model, 'dtype'):
#     model.dtype = model.model.dtype

# if not hasattr(model, 'embed_tokens'):
#     model.embed_tokens = model.model.model.embed_tokens

# if not hasattr(model.model, 'embed_tokens'):
#     model.model.embed_tokens = model.model.model.embed_tokens

TheBloke

Owner Jun 14, 2023

Yup that's the fix for the moment until ooba fixes text-gen

Boffy

Jun 14, 2023

•

edited Jun 14, 2023

I hope the fixes will fix performance for gpu's on windows.. that's another area with cuda etc versions etc getting it all working with various diffferent versions is a mess, does anyone know if WSL is better or Linux? I'd be temped to dual boot if it was to see just cus it's shame to see a 4090 only being used 30% guess I wouldn't care but the text response rate is only like 5t/s :(

mancub

Jun 15, 2023

I did not find the native Linux being better than WSL.

Matter a fact I couldn't fully load the models in Linux I used to load in WSL just fine, because X/Wayland was taking more VRAM away than Windows GUI.

Obviously I could init 3, but then without a browser there's no access to any UI's like text-gen-webui, and stuck in CLI.

deleted

Jun 15, 2023

I did not find the native Linux being better than WSL.

Matter a fact I couldn't fully load the models in Linux I used to load in WSL just fine, because X/Wayland was taking more VRAM away than Windows GUI.

Obviously I could init 3, but then without a browser there's no access to any UI's like text-gen-webui, and stuck in CLI.

Or 'share' and use a 2nd machine with a browser.. That is what i do.