dfurman
/

Llama-3-8B-Orpo-v0.1

@@ -16,9 +16,9 @@ base_model:
 # dfurman/Llama-3-8B-Orpo-v0.1
-![](https://i.imgur.com/ZHwzQvI.png)
-This is an ORPO fine-tune of [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) on 2k samples of [mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k).
 It's a successful fine-tune that follows the ChatML template!
@@ -36,28 +36,65 @@ TBD.
 You can find the experiment on W&B at [this address](https://wandb.ai/dryanfurman/huggingface/runs/rlytsd0k?nw=nwuserdryanfurman).
 ## 💻 Usage
 ```python
-!pip install -qU transformers accelerate
-from transformers import AutoTokenizer
 import transformers
 import torch
 model = "dfurman/Llama-3-8B-Orpo-v0.1"
-messages = [{"role": "user", "content": "What is a large language model?"}]
 tokenizer = AutoTokenizer.from_pretrained(model)
-prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 pipeline = transformers.pipeline(
     "text-generation",
     model=model,
-    torch_dtype=torch.float16,
-    device_map="auto",
 )
 outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
-print(outputs[0]["generated_text"])
-```

 # dfurman/Llama-3-8B-Orpo-v0.1
+![](https://raw.githubusercontent.com/daniel-furman/sft-demos/main/assets/llama_3.jpeg)
+This is an ORPO fine-tune of [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) on 4k samples of [mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k).
 It's a successful fine-tune that follows the ChatML template!
 You can find the experiment on W&B at [this address](https://wandb.ai/dryanfurman/huggingface/runs/rlytsd0k?nw=nwuserdryanfurman).
 ## 💻 Usage
+<details>
+<summary>Setup</summary>
 ```python
+!pip install -qU transformers accelerate bitsandbytes
+from transformers import AutoTokenizer, BitsAndBytesConfig
 import transformers
 import torch
+if torch.cuda.get_device_capability()[0] >= 8:
+    !pip install -qqq flash-attn
+    attn_implementation = "flash_attention_2"
+    torch_dtype = torch.bfloat16
+else:
+    attn_implementation = "eager"
+    torch_dtype = torch.float16
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch_dtype,
+    bnb_4bit_use_double_quant=True,
+)
 model = "dfurman/Llama-3-8B-Orpo-v0.1"
 tokenizer = AutoTokenizer.from_pretrained(model)
 pipeline = transformers.pipeline(
     "text-generation",
     model=model,
+    model_kwargs={
+        "torch_dtype": torch_dtype,
+        "quantization_config": bnb_config,
+        "device_map": "auto",
+        "attn_implementation": attn_implementation,
+    }
 )
+```
+</details>
+### Run
+```python
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is a large language model?"},
+]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+print("***Prompt:\n", prompt)
 outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
+print("***Generation:\n", outputs[0]["generated_text"])
+```
+### Output
+coming