TheBloke's picture
Initial AutoGPTQ model commit.
7c6a3ac
metadata
datasets:
  - tiiuae/falcon-refinedweb
language:
  - en
inference: false

Falcon-7B-Instruct GPTQ

This repo contains an experimantal GPTQ 4bit model for Falcon-7B-Instruct.

It is the result of quantising to 4bit using AutoGPTQ.

EXPERIMENTAL

Please note this is an experimental first model. Support for it is currently quite limited.

To use it you will require:

  1. AutoGPTQ, from the latest main branch and compiled with pip install .
  2. pip install einops

You can then use it immediately from Python code - see example code below

text-generation-webui

There is also provisional AutoGPTQ support in text-generation-webui.

However at the time I'm writing this, a commit is needed to text-generation-webui to enable it to load this model.

I have opened a PR here; once this is merged, text-generation-webui will support this GPTQ model.

To get it working before the PR is merged, you will need to:

  1. Edit text-generation-webui/modules/AutoGPTQ_loader.py
  2. Make the following change:

Find the line that says:

'use_safetensors': use_safetensors,

And after it, add:

'trust_remote_code': shared.args.trust_remote_code,

Once you are done the file should look like this

  1. Then save and close the file, and launch text-generation-webui as described below

How to download and use this model in text-generation-webui

  1. Launch text-generation-webui with the following command-line arguments: --autogptq --trust_remote_code
  2. Click the Model tab.
  3. Under Download custom model or LoRA, enter TheBloke/falcon-7B-instruct-GPTQ.
  4. Click Download.
  5. Wait until it says it's finished downloading.
  6. Click the Refresh icon next to Model in the top left.
  7. In the Model drop-down: choose the model you just downloaded, falcon-7B-instruct-GPTQ.
  8. Once it says it's loaded, click the Text Generation tab and enter a prompt!

About trust_remote_code

Please be aware that this command line argument causes Python code provided by Falcon to be executed on your machine.

This code is required at the moment because Falcon is too new to be supported by Hugging Face transformers. At some point in the future transformers will support the model natively, and then trust_remote_code will no longer be needed.

In this repo you can see two .py files - these are the files that get executed. They are copied from the base repo at Falcon-7B-Instruct.

Simple Python example code

To run this code you need to install AutoGPTQ from source:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install . # This step requires CUDA toolkit installed

And install einops:

pip install einops

You can then run this example code:

import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

# Download the model from HF and store it locally, then reference its location here:
quantized_model_dir = "/path/to/falcon7b-instruct-gptq"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False, use_safetensors=True, torch_dtype=torch.float32, trust_remote_code=True)

prompt = "Write a story about llamas"
prompt_template = f"### Instruction: {prompt}\n### Response:"

tokens = tokenizer(prompt_template, return_tensors="pt").to("cuda:0").input_ids
output = model.generate(input_ids=tokens, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0]))

Provided files

Falcon-7B-Instruct-GPTQ-4bit-128g.safetensors

This will work with AutoGPTQ as of commit 3cb1bf5 (3cb1bf5a6d43a06dc34c6442287965d1838303d3)

It was created with groupsize 64 to give higher inference quality, and without desc_act (act-order) to increase inference speed.

  • Falcon-7B-Instruct-GPTQ-4bit-128g.safetensors
    • Works only with latest AutoGPTQ CUDA, compiled from source as of commit 3cb1bf5
      • At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
    • Works with text-generation-webui using --autogptq --trust_remote_code
      • At this time it does NOT work with one-click-installers
    • Does not work with any version of GPTQ-for-LLaMa
    • Parameters: Groupsize = 64. No act-order.

✨ Original model card: Falcon-7B-Instruct

Falcon-7B-Instruct is a 7B parameters causal decoder-only model built by TII based on Falcon-7B and finetuned on a mixture of chat/instruct datasets. It is made available under the TII Falcon LLM License.

Paper coming soon 😊.

Why use Falcon-7B-Instruct?

πŸ’¬ This is an instruct model, which may not be ideal for further finetuning. If you are interested in building your own instruct/chat model, we recommend starting from Falcon-7B.

πŸ”₯ Looking for an even more powerful model? Falcon-40B-Instruct is Falcon-7B-Instruct's big brother!

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

πŸ’₯ Falcon LLMs require PyTorch 2.0 for use with transformers!

Model Card for Falcon-7B-Instruct

Model Details

Model Description

Model Source

  • Paper: coming soon.

Uses

Direct Use

Falcon-7B-Instruct has been finetuned on a mixture of instruct and chat datasets.

Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

Bias, Risks, and Limitations

Falcon-7B-Instruct is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

Recommendations

We recommend users of Falcon-7B-Instruct to develop guardrails and to take appropriate precautions for any production use.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Training Details

Training Data

Falcon-7B-Instruct was finetuned on a 250M tokens mixture of instruct/chat datasets.

Data source Fraction Tokens Description
Bai ze 65% 164M chat
GPT4All 25% 62M instruct
GPTeacher 5% 11M instruct
RefinedWeb-English 5% 13M massive web crawl

The data was tokenized with the Falcon-7B/40B tokenizer.

Evaluation

Paper coming soon.

See the OpenLLM Leaderboard for early results.

Note that this model variant is not optimized for NLP benchmarks.

Technical Specifications

For more information about pretraining, see Falcon-7B.

Model Architecture and Objective

Falcon-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper (Brown et al., 2020), with the following differences:

Hyperparameter Value Comment
Layers 32
d_model 4544 Increased to compensate for multiquery
head_dim 64 Reduced to optimise for FlashAttention
Vocabulary 65024
Sequence length 2048

Compute Infrastructure

Hardware

Falcon-7B-Instruct was trained on AWS SageMaker, on 32 A100 40GB GPUs in P4d instances.

Software

Falcon-7B-Instruct was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

Citation

Paper coming soon 😊.

License

Falcon-7B-Instruct is made available under the TII Falcon LLM License. Broadly speaking,

  • You can freely use our models for research and/or personal purpose;
  • You are allowed to share and build derivatives of these models, but you are required to give attribution and to share-alike with the same license;
  • For commercial use, you are exempt from royalties payment if the attributable revenues are inferior to $1M/year, otherwise you should enter in a commercial agreement with TII.

Contact

falconllm@tii.ae