Problem with 7b tokenizer

#3
by vmajor - opened

13B model seems to work well (although I am trying to diagnose its seemingly random refusal to process inputs), but when attempting to use the 7B model I get this error:

return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Path to 7B tokenizer is correctly set. Both the model and tokenizer are locally hosted.

Never mind. I had downloaded the pointer file, not the actual tokenizer.model.

vmajor changed discussion status to closed

Glad you got it sorted.

13B model seems to work well (although I am trying to diagnose its seemingly random refusal to process inputs)

If using text-generation-webui, re-download and update to the latest version. There was a bug introduced in the last couple of days that caused Vicuna models to stop generating text very early. It's been fixed in the last 12 hours.

I have a standalone python program based on GPTQ-for-LLaMa. It works very well, as well as the Alpaca-65b for my needs, but it only works reliable if provided with the prompts using input(), it seemingly randomly refuses to process prompts given to it programmatically, inside a python loop. After two days of trying to make it obey the law I gave up and now have several smaller python programs that are driven by a bash script. Other models (every 7B and 13B GPTQ compatible model that I found on HuggingFace) that I tried do not have this problem, but their output quality is not good enough. I will try Alpacino 4bit.safetensors now.

Also, do you know what may be causing the error with my python loop? I really ran out of ideas on what to check. Even inserting a conditional loop and giving it time, or changing seed on each failure to generate just made the model not give me any output at all... almost as if it is self aware and extremely stubborn.

Not really sure. Show the Python code?

import torch
import torch.nn as nn
import quant
from gptq import GPTQ
from utils import find_layers, DEV, set_seed, get_wikitext2, get_ptb, get_c4, get_ptb_new, get_c4_new, get_loaders
import transformers
from transformers import AutoTokenizer
import argparse
import warnings

# Suppress warnings from the specified modules
warnings.filterwarnings("ignore", module="safetensors")
warnings.filterwarnings("ignore", module="torch")

def get_llama(model):

    def skip(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = skip
    torch.nn.init.uniform_ = skip
    torch.nn.init.normal_ = skip
    from transformers import LlamaForCausalLM
    model = LlamaForCausalLM.from_pretrained(model, torch_dtype='auto')
    model.seqlen = 2048
    return model


def load_quant(model, checkpoint, wbits, groupsize=-1, fused_mlp=True, eval=True, warmup_autotune=True):
    from transformers import LlamaConfig, LlamaForCausalLM
    config = LlamaConfig.from_pretrained(model)

    def noop(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = noop
    torch.nn.init.uniform_ = noop
    torch.nn.init.normal_ = noop

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = LlamaForCausalLM(config)
    torch.set_default_dtype(torch.float)
    if eval:
        model = model.eval()
    layers = find_layers(model)
    for name in ['lm_head']:
        if name in layers:
            del layers[name]
    quant.make_quant_linear(model, layers, wbits, groupsize)

    del layers

    print('Loading model ...')
    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint), strict=False)
    else:
        model.load_state_dict(torch.load(checkpoint), strict=False)

    quant.make_quant_attn(model)
    if eval and fused_mlp:
        quant.make_fused_mlp(model)

    if warmup_autotune:
        quant.autotune_warmup_linear(model, transpose=not (eval))
        if eval and fused_mlp:
            quant.autotune_warmup_fused(model)
    model.seqlen = 2048
    print('Done.')

    return model

def run_llama_inference(
    model_path,
    wbits=4,
    groupsize=-1,
    load_path="",
    text="",
    min_length=10,
    max_length=1024,
    top_p=0.7,
    temperature=0.8,
    device=0,
):

    if load_path:
        model = load_quant(model_path, load_path, wbits, groupsize)
    else:
        model = get_llama(model_path)
        model.eval()

    model.to(DEV)
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
    input_ids = tokenizer.encode(text, return_tensors="pt").to(DEV)

    with torch.no_grad():
        generated_ids = model.generate(
            input_ids,
            do_sample=True,
            min_length=min_length,
            max_length=max_length,
            top_p=top_p,
            temperature=temperature,
        )
    return tokenizer.decode([el.item() for el in generated_ids[0]])

def main():
    parser = argparse.ArgumentParser(description="Summarize an article using Vicuna.")
    parser.add_argument('--text', required=True, help='The text to summarize.')
    args = parser.parse_args()

    model_path = "~/models/Vicuna-13B-quantized-128g"
    load_path = "~/models/Vicuna-13B-quantized-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors"
    wbits = 4
    groupsize = 128

    output = run_llama_inference(
        model_path,
        wbits=wbits,
        groupsize=groupsize,
        load_path=load_path,
        text=args.text,
    )

    with open("output.txt", "a", encoding="utf-8") as f:
        f.write(f"{args.text}\n{output}\n")

    print(f"Output: {output}")

if __name__ == "__main__":
    main()

Nice code! I like that.

OK so using this code I think I finally diagnosed an issue that's been bugging me as well. When you say "refuses to process inputs" do you mean it stops generating really soon? If so then I noticed that as well.

My standard test prompt is:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a story about llamas
### Response:

And my Vicuna 1.1 7B and 13B would consistently answer with: Once upon a time, in a land far, far away, there lived a herd of llama and then just stop there.

I think I just figured it out! I had pad_token : -1 in config.json for some reason. And it should have been pad_token: 0.

I've fixed that in the repos and tested again with your code and now it's reliably answering correctly:

root@9f5e0b1e927a:~/gptq-llama# python do_inf.py --text "Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a story about llamas
### Response:"
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:39<00:00,  3.28s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:15<00:00,  1.30s/it]
Done.
Output: <s> Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a story about llamas
### Response:
Once upon a time, in a land far, far away, there lived a group of llamas. They lived in a beautiful valley surrounded by towering mountains. The llamas were happy and content, roaming freely through the lush green fields and forests.

One day, a group of explorers came to the valley. They were fascinated by the llamas and wanted to learn more about them. The llamas, in turn, were intrigued by the explorers and their strange clothing and tools.

As the days passed, the llamas and the explorers became good friends. They learned from each other and taught each other their ways of life. The llamas showed the explorers how to survive in the harsh climate, while the explorers taught the llamas about their culture and civilization.

One day, the explorers had to leave the valley and return to their own land. The llamas were sad to see them go, but they knew that they would always have a special place in their hearts. The llamas continued to roam freely in the valley, always remembering their friends from the land far, far away.

The end.</s>
root@9f5e0b1e927a:~/gptq-llama#

Please re-download config.json and test again with your code and let me know.

OK I will try that! Quantizing my own Alpacino 13b at the moment so my workstation is going to be a little busy for a while. And I realized that I shared the working code, the one that takes a single input() and that I am now using as a part of that bash script that I mentioned. Let me carefully look at the code that does not work and give you that :) I tried all kinds of stuff in here. "Prewarming" the model, a while loop, nothing works, and feel free to remove both and observe the model still not working reliably. When it summarises it does it really well, but mostly it returns empty output and just skips ahead. The dependencies are all pip installable:

import torch
import torch.nn as nn
import quant
from gptq import GPTQ
from utils import find_layers, DEV, set_seed, get_wikitext2, get_ptb, get_c4, get_ptb_new, get_c4_new, get_loaders
import transformers
from transformers import AutoTokenizer
import csv
import FinNews as fn
import requests
from bs4 import BeautifulSoup
import argparse
import time
from utils import set_seed
import random

def get_llama(model):

    def skip(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = skip
    torch.nn.init.uniform_ = skip
    torch.nn.init.normal_ = skip
    from transformers import LlamaForCausalLM
    model = LlamaForCausalLM.from_pretrained(model, torch_dtype='auto')
    model.seqlen = 2048
    return model


def load_quant(model, checkpoint, wbits, groupsize=-1, fused_mlp=True, eval=True, warmup_autotune=True):
    from transformers import LlamaConfig, LlamaForCausalLM
    config = LlamaConfig.from_pretrained(model)

    def noop(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = noop
    torch.nn.init.uniform_ = noop
    torch.nn.init.normal_ = noop

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = LlamaForCausalLM(config)
    torch.set_default_dtype(torch.float)
    if eval:
        model = model.eval()
    layers = find_layers(model)
    for name in ['lm_head']:
        if name in layers:
            del layers[name]
    quant.make_quant_linear(model, layers, wbits, groupsize)

    del layers

    print('Loading model ...')
    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint), strict=False)
    else:
        model.load_state_dict(torch.load(checkpoint), strict=False)

    quant.make_quant_attn(model)
    if eval and fused_mlp:
        quant.make_fused_mlp(model)

    if warmup_autotune:
        quant.autotune_warmup_linear(model, transpose=not (eval))
        if eval and fused_mlp:
            quant.autotune_warmup_fused(model)
    model.seqlen = 2048
    print('Done.')

    return model

def run_llama_inference(
    model,
    tokenizer,
    wbits=4,
    groupsize=-1,
    texts=[],
    min_length=10,
    max_length=2048,
    top_p=0.7,
    temperature=0.8,
    device=0,
):
    model = model.to(DEV)

    # Dummy generation for warm-up
    dummy_input = tokenizer.encode("Dummy input for warm-up", return_tensors="pt").to(device)
    with torch.no_grad():
        _ = model.generate(dummy_input)

    answers = []

    for text in texts:
        input_ids = tokenizer.encode(text, return_tensors="pt").to(DEV)

        answer = ""
        attempts = 0
        max_attempts = 5

        while attempts < max_attempts:
            with torch.no_grad():
                generated_ids = model.generate(
                    input_ids,
                    do_sample=True,
                    min_length=min_length,
                    max_length=max_length,
                    top_p=top_p,
                    temperature=temperature
                )

            output = tokenizer.decode([el.item() for el in generated_ids[0]])
            parts = output.split("Answer:")

            if len(parts) == 2:
                answer = parts[1].strip()
                if len(answer) > 10:  # Check if the answer has more than 10 characters
                    break

            attempts += 1
            sleep_time = random.uniform(0.1, 0.5)  # Random sleep time between 0.1 and 0.5 seconds
            time.sleep(sleep_time)

        answers.append(answer)

    return answers

def load_processed_articles(file_name):
    processed_articles = set()
    try:
        with open(file_name, 'r') as file:
            for line in file:
                processed_articles.add(line.strip())
    except FileNotFoundError:
        pass
    return processed_articles

def save_processed_articles(file_name, processed_articles):
    with open(file_name, 'w') as file:
        for article in processed_articles:
            file.write(f"{article}\n")


def get_first_paragraphs(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the main content of the article using CSS selector
    paragraphs = soup.select('div.group > p')

    if paragraphs:
        extracted_text = []
        for p in paragraphs[:5]:  # Change this to 5 to get the first five paragraphs
            text = p.get_text()  # Remove the strip=True parameter
            extracted_text.append(text)
        return ' '.join(extracted_text)
    return ""

if __name__ == "__main__":
    model_path = "~/models/Vicuna-13B-quantized-128g"
    load_path = "~/models/Vicuna-13B-quantized-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors"
    wbits = 4
    groupsize = 128

    # Load the model
    model = load_quant(model_path, load_path, wbits, groupsize)
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

    # Fetch news articles from CNBC using FinNews library
    cnbc_feed = fn.CNBC(topics=['finance', 'earnings'])
    cnbc_news = cnbc_feed.get_news()

    # Load processed articles
    processed_articles_file = "processed_articles.txt"
    processed_articles = load_processed_articles(processed_articles_file)

    # Prepare the texts for inference
    texts = []
    articles_to_process = []

    for article in cnbc_news:
        article_id = article['id']

        if article_id in processed_articles:
            print(f"Article {article_id} already processed")
            continue

        url = article['link']
        first_paragraphs = get_first_paragraphs(url)
        if first_paragraphs:
            title = article['title']
            summary_prompt = "Question: Please provide a concise summary of the following news article, capturing the key information and stating company ticker symbols, and government entity abbreviations, whenever possible: "
            texts.append(summary_prompt + title + ". " + first_paragraphs + " Answer: ")
            articles_to_process.append(article)
        else:
            print(f"Could not extract content from {url}")

    # Run the inference for all texts
    summaries = run_llama_inference(
        model,
        tokenizer,
        wbits=wbits,
        groupsize=groupsize,
        texts=texts,
        min_length=10,
        max_length=1024,
        top_p=0.7,
        temperature=0.8,
    )

    # Write the results to the CSV file
    with open('cnbc_news_summaries.csv', 'w', newline='', encoding='utf-8') as csvfile:
        # Create a CSV writer object
        csv_writer = csv.writer(csvfile)

        # Write the header row
        csv_writer.writerow(['ID', 'Date', 'Title', 'Summary'])

        for idx, summary in enumerate(summaries):
            article = articles_to_process[idx]
            article_id = article['id']
            title = article['title']
            print("Title: ", title)

            if summary:
                # Write the row to the CSV file
                csv_writer.writerow([article_id, article['published'], title, summary])

                processed_articles.add(article_id)

                # Clear past attentions and hidden states
                if hasattr(model, 'past'):
                    del model.past
                torch.cuda.empty_cache()

            else:
                # Print an error message if there is no answer in the output
                print("No answer found in the output.")

    save_processed_articles(processed_articles_file, processed_articles)

OK I have it working (mostly!)

Firstly I fixed a couple more things in my .json files that may or may not be affecting inference. Probably not, but just to let you know to re-pull those to be sure.

Secondly, I ran your loop code and found the same issue as you. I did some debug and I believe the primary issue is that you weren't using the Vicuna trained prompt template

This is the prompt template I always use for Vicuna:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
prompt goes here
### Response:

So in your case, I modified the code to use this format:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Please provide a concise summary of the following news article, capturing the key information and stating company ticker symbols, and government entity abbreviations, whenever possible: <ARTICLE GOES HERE>
### Response:

And this gets MUCH better results.

It's still not perfect. I still get responses which end at the prompt, and so go round multiple times through your attempts loop and some that still failed after 5 attempts. But you can see the difference from using this prompt template in the output files:

root@9f5e0b1e927a:~/gptq-llama# ll cnbc_news_summaries.csv*
-rw-r--r-- 1 root root 44674 Apr 28 11:20 cnbc_news_summaries.csv
-rw-r--r-- 1 root root 26698 Apr 28 09:23 cnbc_news_summaries.csv.orig

The orig file was the result I got running your code, and the new .csv is the result using the above prompt template. Nearly twice as much data was returned.

Below is my updated file, including some debug print statements so we can see what it's doing as it progresses.

import torch
import torch.nn as nn
import quant
from gptq import GPTQ
from utils import find_layers, DEV, set_seed, get_wikitext2, get_ptb, get_c4, get_ptb_new, get_c4_new, get_loaders
import transformers
from transformers import AutoTokenizer
import csv
import FinNews as fn
import requests
from bs4 import BeautifulSoup
import argparse
import time
from utils import set_seed
import random

def get_llama(model):

    def skip(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = skip
    torch.nn.init.uniform_ = skip
    torch.nn.init.normal_ = skip
    from transformers import LlamaForCausalLM
    model = LlamaForCausalLM.from_pretrained(model, torch_dtype='auto')
    model.seqlen = 2048
    return model


def load_quant(model, checkpoint, wbits, groupsize=-1, fused_mlp=True, eval=True, warmup_autotune=True):
    from transformers import LlamaConfig, LlamaForCausalLM
    config = LlamaConfig.from_pretrained(model)

    def noop(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = noop
    torch.nn.init.uniform_ = noop
    torch.nn.init.normal_ = noop

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = LlamaForCausalLM(config)
    torch.set_default_dtype(torch.float)
    if eval:
        model = model.eval()
    layers = find_layers(model)
    for name in ['lm_head']:
        if name in layers:
            del layers[name]
    quant.make_quant_linear(model, layers, wbits, groupsize)

    del layers

    print('Loading model ...')
    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint), strict=False)
    else:
        model.load_state_dict(torch.load(checkpoint), strict=False)

    quant.make_quant_attn(model)
    if eval and fused_mlp:
        quant.make_fused_mlp(model)

    if warmup_autotune:
        quant.autotune_warmup_linear(model, transpose=not (eval))
        if eval and fused_mlp:
            quant.autotune_warmup_fused(model)
    model.seqlen = 2048
    print('Done.')

    return model

def run_llama_inference(
    model,
    tokenizer,
    wbits=4,
    groupsize=-1,
    texts=[],
    min_length=10,
    max_length=2048,
    top_p=0.7,
    temperature=0.8,
    device=0,
):
    model = model.to(DEV)

    # Dummy generation for warm-up
    dummy_input = tokenizer.encode("Dummy input for warm-up", return_tensors="pt").to(device)
    with torch.no_grad():
        _ = model.generate(dummy_input)

    answers = []

    for text in texts:
        #print("Input is: ", text)
        input_ids = tokenizer.encode(text, return_tensors="pt").to(DEV)

        answer = ""
        attempts = 0
        max_attempts = 5

        while attempts < max_attempts:
            with torch.no_grad():
                generated_ids = model.generate(
                    input_ids,
                    do_sample=True,
                    min_length=min_length,
                    max_new_tokens=max_length,
                    top_p=top_p,
                    temperature=temperature
                )

            output = tokenizer.decode([el.item() for el in generated_ids[0]])
            print("Raw output is: ", output)
            parts = output.split("### Response:")

            if len(parts) == 2:
                answer = parts[1].strip()
                if len(answer) > 10:  # Check if the answer has more than 10 characters
                    print("Answer has more than 10 chars")
                    break
                else:
                    print("Answer does not have more than 10 chars, going round again")

            attempts += 1
            sleep_time = random.uniform(0.1, 0.5)  # Random sleep time between 0.1 and 0.5 seconds
            time.sleep(sleep_time)

        answers.append(answer)

    return answers

def load_processed_articles(file_name):
    processed_articles = set()
    try:
        with open(file_name, 'r') as file:
            for line in file:
                processed_articles.add(line.strip())
    except FileNotFoundError:
        pass
    return processed_articles

def save_processed_articles(file_name, processed_articles):
    with open(file_name, 'w') as file:
        for article in processed_articles:
            file.write(f"{article}\n")


def get_first_paragraphs(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the main content of the article using CSS selector
    paragraphs = soup.select('div.group > p')

    if paragraphs:
        extracted_text = []
        for p in paragraphs[:5]:  # Change this to 5 to get the first five paragraphs
            text = p.get_text()  # Remove the strip=True parameter
            extracted_text.append(text)
        return ' '.join(extracted_text)
    return ""

if __name__ == "__main__":
    model_path = "/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g"
    load_path = "/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order.pt"
    wbits = 4
    groupsize = 128

    # Load the model
    model = load_quant(model_path, load_path, wbits, groupsize)
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

    # Fetch news articles from CNBC using FinNews library
    cnbc_feed = fn.CNBC(topics=['finance', 'earnings'])
    cnbc_news = cnbc_feed.get_news()

    # Load processed articles
    processed_articles_file = "processed_articles.txt"
    processed_articles = load_processed_articles(processed_articles_file)

    # Prepare the texts for inference
    texts = []
    articles_to_process = []

    for article in cnbc_news:
        article_id = article['id']

        if article_id in processed_articles:
            print(f"Article {article_id} already processed")
            continue

        url = article['link']
        first_paragraphs = get_first_paragraphs(url)
        if first_paragraphs:
            title = article['title']
            summary_prompt = '''Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Please provide a concise summary of the following news article, capturing the key information and stating company ticker symbols, and government entity abbreviations, whenever possible: '''
            texts.append(summary_prompt + title + ". " + first_paragraphs + "\n### Response: ")
            articles_to_process.append(article)
        else:
            print(f"Could not extract content from {url}")

    # Run the inference for all texts
    summaries = run_llama_inference(
        model,
        tokenizer,
        wbits=wbits,
        groupsize=groupsize,
        texts=texts,
        min_length=10,
        max_length=1024,
        top_p=0.7,
        temperature=0.8,
    )

    # Write the results to the CSV file
    with open('cnbc_news_summaries.csv', 'w', newline='', encoding='utf-8') as csvfile:
        # Create a CSV writer object
        csv_writer = csv.writer(csvfile)

        # Write the header row
        csv_writer.writerow(['ID', 'Date', 'Title', 'Summary'])

        for idx, summary in enumerate(summaries):
            article = articles_to_process[idx]
            article_id = article['id']
            title = article['title']
            print("Title: ", title)

            if summary:
                # Write the row to the CSV file
                csv_writer.writerow([article_id, article['published'], title, summary])

                processed_articles.add(article_id)

                # Clear past attentions and hidden states
                if hasattr(model, 'past'):
                    del model.past
                torch.cuda.empty_cache()

            else:
                # Print an error message if there is no answer in the output
                print("No answer found in the output.")

    save_processed_articles(processed_articles_file, processed_articles)

I'm still confused as to why it sometimes returns no output at all. That will require some further investigation. But this is definitely better!

PS. I did all testing with my Vicuna 7B GPTQ, as that's what I already had downloaded. Might do even better on 13B.

One more tweak - I changed min_tokens to 50. With that:

root@9f5e0b1e927a:~/gptq-llama# ll cnbc*
-rw-r--r-- 1 root root 48834 Apr 28 11:39 cnbc_news_summaries.csv
-rw-r--r-- 1 root root 44674 Apr 28 11:20 cnbc_news_summaries.csv.better
-rw-r--r-- 1 root root 26698 Apr 28 09:23 cnbc_news_summaries.csv.orig

So another ~2kb was returned compared to the previous run. There could also be some randomness in that of course.

I opened the latest run in Excel and it only failed to generate to produce a summary for three articles:

image.png

So definitely progress. Just need to figure out why it sometimes chokes. Could be an issue in the GPTQ code I guess.

Wow, this is great. Thank you for looking into the problem and achieving such a huge improvement in performance. It is indeed odd that it is still refusing to answer occasionally, but it appears to be directly related to the prompt. I will do some experiments once the quantization of Alpacino 13b is done. Speaking of which, how long does it take you to quantize models? I saw on reddit that you used a cloud instance. I am finding that it is the SSD I/O that is getting hammered the most. WSL2 is so overwhelmed that none of the terminal commands that require access to the drive, such as that highly demanding command 'ls', get executed. I am using a native ext4 partition - just to preempt the sigh of horror.

I timed it the other day when making some 7Bs and it took 17 minutes start to finish. For a 13B I guess it's 25-30 mins, but not sure precisely.

It does seem to vary in speed on the cloud systems I use. The actual quantisation part is fairly consistent in speed, and seems to scale linearly according to the number of layers. But then the packing part seems to be quite variable. Packing is currently done only on the CPU. I've noticed variable CPU performance on the cloud pods I use. The GPU is always dedicated to the pod, but the CPU I think can be affected by other activity on the same host.

The other day I had a 7B that took over an hour, because the packing part took forever. It was taking so long I started another pod to do a second one, rather than run them sequentially on the one pod.

But those issues aside, I'd expect a 7B to always be under 20 mins and a 13B to be around 30 mins.

Let me know if you make any progress regarding the inference issues - I'd be interested to know what might be causing the model to occasionally not respond. I have heard from other people that there can be issues when the prompt is particularly long. But the ones that failed in your data don't seem hugely longer than the ones that worked. And it's very odd that the same prompt might fail two or three times, but then succeed on the next.

OK, quantization has been running for several hours already. The longest thread that htop reports has been at it for almost 6 hours... my CPU is a 12 core Ryzen 9 3900XT.

One other thing that I should note about the inference failures is that they never fail when provided using the input() method or the original program that I shared in this thread. So yes it is strongly correlated with the prompt, but the same prompt will never fail to generate a good to perfect (for my needs) response when given through the input() method.

Yeah that is odd! I will investigate that more then.

6 hours.. ouch!

Are there no GPTQs for Alpacino yet then? I'd be happy do one. Link me the base model and I'll take a look.

There is one. I found it two hours into my own effort, so now I am persisting out of stubbornness...and the possibility that the existing one will not work for whatever reason: https://huggingface.co/gozfarb/alpacino-13b-4bit-128g

Source: https://huggingface.co/digitous/Alpacino13b

Just for the sake of clarity and help if someone comes across this thread, I know why the quantization was taking this long:

  1. WSL2 and the host system ran out of storage space, simultaneously. I am still working on identifying where the temporary files that are generated by quantization are stored, they are not inside the directory from where the process is invoked.
  2. An old Intel SSD that I used as a paging file drive died, causing an error with the SATA bus that also took the SSD that hosts the /home ext4 location in WSL2 offline

This caused significant panic before I diagnosed the issue back to the failed SSD. My WSL2 /home directory was also missing because the SSD that hosts it was no longer visible to the BIOS making me concerned that this SSD was also KIA. This error was resolved by unplugging the dead SSD from power and the SATA line.

The missing paging drive cascaded into other errors because the windows host system then moved this file to the boot drive, causing it to run out of space, but this is a familiar territory by now and things are getting under control, slowly.

So definitely progress. Just need to figure out why it sometimes chokes. Could be an issue in the GPTQ code I guess.

I added this back in and it fixed the failed summary on the first go around:

else:
                    print(Fore.RED + "Answer does not have more than 10 chars, changing the seed and going around again")
                    # Set a random seed after an unsuccessful attempt
                    random_seed = random.randint(1, 2000000000)
                    set_seed(random_seed)

Fore.RED is from colorama library. I found it challenging to review the outputs on console if they are all the same colour so now I colour code them.

Ahh nice, yeah that makes sense! Glad it's working now.

Sign up or log in to comment