|
--- |
|
dataset: Thermostatic/flowers |
|
license: other |
|
license_name: gemma-terms-of-use |
|
license_link: https://ai.google.dev/gemma/terms |
|
--- |
|
|
|
# Gemma Orchid 7b |
|
|
|
<div align="center"> |
|
|
|
![image/webp](https://cdn-uploads.huggingface.co/production/uploads/6455cc8d679315e4ef16fbec/7pqiroePJW0WWm6JxwBoO.webp) |
|
|
|
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl) |
|
</div> |
|
|
|
This model is the second checkpoint of a future project. Its capable of function calling as well as having a strong base in communicational skills. |
|
|
|
This model has been finetuned on roughly 80k samples so far. |
|
|
|
# Training |
|
|
|
+ Time to complete: ~20 hours |
|
+ Datasets: Thermostatic/flowers, Intel/orca_dpo_pairs, jondurbin/truthy-dpo-v0.1, glaiveai/glaive_function_calling_v2 |
|
+ Cost: ~$20 in H100 hours |
|
+ Evaluation loss: 0.69 |
|
+ Method: LoRa |
|
+ Prompt Format: ChatML |
|
|
|
Thermostatic/flowers is a blend of open source model generations formatted in ShareGPT. It also includes all of capybara. |
|
|
|
#### Running the model on a CPU |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") |
|
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b") |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
|
|
#### Running the model on a single / multi GPU |
|
|
|
|
|
```python |
|
# pip install accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") |
|
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto") |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
|
|
#### Running the model on a GPU using different precisions |
|
|
|
* _Using `torch.float16`_ |
|
|
|
```python |
|
# pip install accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") |
|
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto", torch_dtype=torch.float16) |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
* _Using `torch.bfloat16`_ |
|
|
|
```python |
|
# pip install accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") |
|
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto", torch_dtype=torch.bfloat16) |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
#### Quantized Versions through `bitsandbytes` |
|
|
|
* _Using 8-bit precision (int8)_ |
|
|
|
```python |
|
# pip install bitsandbytes accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
|
|
quantization_config = BitsAndBytesConfig(load_in_8bit=True) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") |
|
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", quantization_config=quantization_config) |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
* _Using 4-bit precision_ |
|
|
|
```python |
|
# pip install bitsandbytes accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
|
|
quantization_config = BitsAndBytesConfig(load_in_4bit=True) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") |
|
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", quantization_config=quantization_config) |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
|
|
#### Other optimizations |
|
|
|
* _Flash Attention 2_ |
|
|
|
First make sure to install `flash-attn` in your environment `pip install flash-attn` |
|
|
|
```diff |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.float16, |
|
+ attn_implementation="flash_attention_2" |
|
).to(0) |
|
``` |
|
|
|
### Inputs and outputs |
|
|
|
* **Input:** Text string, such as a question, a prompt, or a document to be |
|
summarized. |
|
* **Output:** Generated English-language text in response to the input, such |
|
as an answer to a question, or a summary of a document. |
|
|
|
## Evaluations |
|
|
|
In progress |
|
|
|
## GGUF + iMatrix |
|
|
|
In progress |