FairMind
/

Phi-3-mini-4k-instruct-bnb-4bit-Ita

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

walid-iguider commited on May 2, 2024

Commit

9bf0270

·

verified ·

1 Parent(s): ed5db7d

Update README.md

Files changed (1) hide show

README.md +43 -0

README.md CHANGED Viewed

@@ -32,6 +32,49 @@ Here's a breakdown of the performance metrics:
 |:----------------------------|:----------------------|:----------------|:---------------------|:--------|
 | **Accuracy Normalized**     | 0.5841                | 0.4414        | 0.5365              | 0.5250  |
 This model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 |:----------------------------|:----------------------|:----------------|:---------------------|:--------|
 | **Accuracy Normalized**     | 0.5841                | 0.4414        | 0.5365              | 0.5250  |
+---
+## How to Use
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
+import torch
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+tokenizer = AutoTokenizer.from_pretrained("FairMind/Phi-3-mini-4k-instruct-bnb-4bit-Ita")
+model = AutoModelForCausalLM.from_pretrained("FairMind/Phi-3-mini-4k-instruct-bnb-4bit-Ita")
+model.to(device)
+generation_config = GenerationConfig(
+      penalty_alpha=0.6, # The values balance the model confidence and the degeneration penalty in contrastive search decoding.
+      do_sample = True, # Whether or not to use sampling ; use greedy decoding otherwise.
+      top_k=5, #  The number of highest probability vocabulary tokens to keep for top-k-filtering.
+      temperature=0.001, #  The value used to modulate the next token probabilities.
+      repetition_penalty=1.7, # The parameter for repetition penalty. 1.0 means no penalty.
+      max_new_tokens = 64, # The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
+      eos_token_id=tokenizer.eos_token_id, # The id of the *end-of-sequence* token.
+      pad_token_id=tokenizer.eos_token_id, # The id of the *padding* token.
+  )
+def generate_answer(question):
+    messages = [
+        {"role": "user", "content": question},
+    ]
+    model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
+    outputs = model.generate(model_inputs, generation_config=generation_config)
+    result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+    return result
+question = """Quale è la torre più famosa di Parigi?"""
+answer = generate_answer(question)
+print(answer)
+```
+---
 This model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)