--- datasets: - OpenAssistant/oasst1 pipeline_tag: text-generation --- # Falcon-40b-chat-oasst1 Falcon-40b-chat-oasst1 is a chatbot-like model for dialogue generation. It was built by fine-tuning [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) on the [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) dataset. This model was fine-tuned in 4-bit using 🤗 [peft](https://github.com/huggingface/peft) adapters, [transformers](https://github.com/huggingface/transformers), and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). - The training relied on a recent method called "Low Rank Adapters" ([LoRA](https://arxiv.org/pdf/2106.09685.pdf)), instead of fine-tuning the entire model you just have to fine-tune adapters and load them properly inside the model. - Training took approximately 10 hours and was executed on a workstation with a single NVIDIA A100-SXM 40GB GPU (via Google Colab). - See attached [Notebook](https://huggingface.co/dfurman/falcon-40b-chat-oasst1/blob/main/finetune_falcon40b_oasst1_with_bnb_peft.ipynb) for the code (and hyperparams) used to train the model. ## Model Summary - **Model Type:** Causal decoder-only - **Language(s) (NLP):** English (primarily) - **Base Model:** [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) (License: [TII Falcon LLM License](https://huggingface.co/tiiuae/falcon-40b#license), commercial use ok-ed) - **Dataset:** [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) (License: [Apache 2.0](https://huggingface.co/datasets/OpenAssistant/oasst1/blob/main/LICENSE), commercial use ok-ed) ### Model Date May 30, 2023 ## Quick Start To prompt the chat model, use the following format: ``` : [Instruction] : ``` ### Example Dialogue 1 **Prompter**: ``` """: My name is Daniel. Write a short email to my closest friends inviting them to come to my home on Friday for a dinner party, I will make the food but tell them to BYOB. :""" ``` **Falcon-40b-chat-oasst1**: ``` [Coming] ``` ### Example Dialogue 2 **Prompter**: ``` : Create a list of five things to do in San Francisco.\n : ``` **Falcon-40b-chat-oasst1**: ``` [Coming] ``` ### Direct Use This model has been finetuned on conversation trees from [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) and should only be used on data of a similar nature. ### Out-of-Scope Use Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful. ## Bias, Risks, and Limitations This model is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online. ### Recommendations We recommend users of this model to develop guardrails and to take appropriate precautions for any production use. ## How to Get Started with the Model ### Setup ```python # Install and import packages !pip install -q -U bitsandbytes loralib einops !pip install -q -U git+https://github.com/huggingface/transformers.git !pip install -q -U git+https://github.com/huggingface/peft.git !pip install -q -U git+https://github.com/huggingface/accelerate.git import torch from peft import PeftModel, PeftConfig from transformers import AutoModelForCausalLM, AutoTokenizer ``` ### GPU Inference in 4-bit This requires a GPU with at least 27GB memory. ```python # load the model peft_model_id = "dfurman/falcon-40b-chat-oasst1" config = PeftConfig.from_pretrained(peft_model_id) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( config.base_model_name_or_path, return_dict=True, quantization_config=bnb_config, device_map={"":0}, use_auth_token=True, trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) tokenizer.pad_token = tokenizer.eos_token model = PeftModel.from_pretrained(model, peft_model_id) ``` ```python # run the model prompt = """: My name is Daniel. Write a long email to my closest friends inviting them to come to my home on Friday for a dinner party, I will make the food but tell them to BYOB. :""" batch = tokenizer( prompt, padding=True, truncation=True, return_tensors='pt' ) batch = batch.to('cuda:0') with torch.cuda.amp.autocast(): output_tokens = model.generate( input_ids = batch.input_ids, max_new_tokens=200, temperature=0.7, top_p=0.7, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id, ) # Inspect outputs print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True)) ``` ## Reproducibility - See attached [Notebook](https://huggingface.co/dfurman/falcon-40b-chat-oasst1/blob/main/finetune_falcon40b_oasst1_with_bnb_peft.ipynb) for the code (and hyperparams) used to train the model. ### CUDA Info - CUDA Version: 12.0 - GPU Name: NVIDIA A100-SXM - Max Memory: {0: "37GB"} - Device Map: {"": 0} ### Package Versions Employed - `torch`: 2.0.1+cu118 - `transformers`: 4.30.0.dev0 - `peft`: 0.4.0.dev0 - `accelerate`: 0.19.0 - `bitsandbytes`: 0.39.0 - `einops`: 0.6.1