harithapliyal
/

llama-3-8b-bnb-4bit-finetuned-SentAnalysis

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

harithapliyal commited on Sep 6

Commit

c811d3e

•

1 Parent(s): 5180e19

Update README.md

Files changed (1) hide show

README.md +52 -0

README.md CHANGED Viewed

@@ -20,3 +20,55 @@ tags:
 This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
+from google.colab import userdata
+HF_KEY = userdata.get('HF_KEY')
+from unsloth import FastLanguageModel
+import torch
+<!-- from transformers import TrainingArguments
+from trl import SFTTrainer
+from unsloth import is_bfloat16_supported
+!pip uninstall -y xformers
+!pip install xformers
+!python -m xformers.info
+!pip install triton -->
+# Load model directly
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+# Configure the quantization
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype="float16"
+)
+# Load the model with quantization
+model1 = AutoModelForCausalLM.from_pretrained(
+    "harithapliyal/llama-3-8b-bnb-4bit-finetuned-SentAnalysis",
+    quantization_config=bnb_config
+)
+FastLanguageModel.for_inference(model1) # Enable native 2x faster inference
+inputs = tokenizer(
+[
+    fine_tuned_prompt.format(
+        "Classify the sentiment of the following text.", # instruction
+        "I like play yoga under the rain", # input
+        "", # output - leave this blank for generation!
+    )
+], return_tensors = "pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
+outputs = tokenizer.decode(outputs[0])
+print(outputs)