|
--- |
|
license: apache-2.0 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
<p align="center" style="font-size:34px;"><b>Buddhi 7B</b></p> |
|
|
|
# Buddhi-7B vLLM Inference: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11_8W8FpKK-856QdRVJLyzbu9g-DMxNfg?usp=sharing) |
|
|
|
## Model Description |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
Buddhi is a general-purpose chat model, meticulously fine-tuned on the Mistral 7B Instruct, and optimised to handle an extended context length of up to 128,000 tokens using the innovative YaRN [(Yet another Rope Extension)](https://arxiv.org/abs/2309.00071) Technique. This enhancement allows Buddhi to maintain a deeper understanding of context in long documents or conversations, making it particularly adept at tasks requiring extensive context retention, such as comprehensive document summarization, detailed narrative generation, and intricate question-answering. |
|
|
|
## Dataset Creation |
|
|
|
## Architecture |
|
|
|
### Hardware requirements: |
|
> For 128k Context Length |
|
> - 80GB VRAM - A100 Preferred |
|
|
|
> For 32k Context Length |
|
> - 40GB VRAM - A100 Preferred |
|
|
|
### vLLM - For Faster Inference |
|
|
|
#### Installation |
|
|
|
``` |
|
!pip install vllm |
|
!pip install flash_attn # If Flash Attention 2 is supported by your System |
|
``` |
|
Please check out [Flash Attention 2](https://github.com/Dao-AILab/flash-attention) Github Repository for more instructions on how to Install it. |
|
|
|
**Implementation**: |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
llm = LLM( |
|
model='aiplanet/Buddhi-128K-Chat', |
|
gpu_memory_utilization=0.99, |
|
max_model_len=131072 |
|
) |
|
|
|
prompts = [ |
|
"""<s> [INST] Please tell me a joke. [/INST] """, |
|
"""<s> [INST] What is Machine Learning? [/INST] """ |
|
] |
|
|
|
sampling_params = SamplingParams( |
|
temperature=0.8, |
|
top_p=0.95, |
|
max_tokens=1000 |
|
) |
|
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
for output in outputs: |
|
prompt = output.prompt |
|
generated_text = output.outputs[0].text |
|
print(generated_text) |
|
print("\n\n") |
|
``` |
|
|
|
### Transformers - Basic Implementation |
|
|
|
```python |
|
import torch |
|
import transformers |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
) |
|
|
|
model_name = "aiplanet/Buddhi-128K-Chat" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
quantization_config=bnb_config, |
|
device_map="sequential", |
|
trust_remote_code=True |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
model, |
|
trust_remote_code=True |
|
) |
|
|
|
prompt = "<s> [INST] Please tell me a small joke. [/INST] " |
|
|
|
tokens = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
outputs = model.generate( |
|
**tokens, |
|
max_new_tokens=100, |
|
do_sample=True, |
|
top_p=0.95, |
|
temperature=0.8, |
|
) |
|
|
|
decoded_output = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0] |
|
print(f"Output:\n{decoded_output[len(prompt):]}") |
|
``` |
|
|
|
Output |
|
|
|
``` |
|
Output: |
|
Why don't scientists trust atoms? |
|
|
|
Because they make up everything. |
|
``` |
|
|
|
## Evaluation |
|
|
|
| Model | HellaSWAG | ARC-Challenge | MMLU | TruthfulQA | Winogrande | |
|
|--------------------------------------|-----------|---------------|-------|------------|------------| |
|
| Buddhi-128K-Chat | 82.78 | 57.51 | 57.39 | 55.44 | 78.37 | |
|
| NousResearch/Yarn-Mistral-7b-128k | 80.58 | 58.87 | 60.64 | 42.46 | 72.85 | |
|
|
|
|
|
## Prompt Template for Buddi-128-Chat |
|
|
|
In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [/INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id. |
|
|
|
``` |
|
"<s>[INST] What is your favourite condiment? [/INST]" |
|
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> " |
|
"[INST] Do you have mayonnaise recipes? [/INST]" |
|
|
|
``` |
|
|
|
## Get in Touch |
|
|
|
You can schedule a 1:1 meeting with our DevRel & Community Team to get started with AI Planet Open Source LLMs and GenAI Stack. Schedule the call here: [https://calendly.com/jaintarun](https://calendly.com/jaintarun) |
|
|
|
Stay tuned for more updates and be a part of the coding evolution. Join us on this exciting journey as we make AI accessible to all at AI Planet! |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.39.2 |
|
- Pytorch 2.2.1+cu121 |
|
- Datasets 2.18.0 |
|
- Accelerate 0.27.2 |
|
- flash_attn 2.5.6 |
|
|
|
### Citation |
|
|
|
``` |
|
@misc {Chaitanya890, lucifertrj , |
|
author = { {Chaitanya Singhal},{Tarun Jain} }, |
|
title = { Buddhi-128k-Chat by AI Planet}, |
|
year = 2024, |
|
url = { https://huggingface.co/aiplanet//Buddhi-128K-Chat }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |