File size: 3,197 Bytes
7479fcd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: mit
language:
- en
tags:
- AWQ
- phi3
---

# Phi 3 mini 4k instruct - AWQ
- Model creator: [Microsoft](https://huggingface.co/microsoft)
- Original model: [Phi 3 mini 4k Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

<!-- description start -->
## Description

This repo contains AWQ model files for [Microsoft's Phi 3 mini 4k Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct).


### About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

It is also now supported by continuous batching server [vLLM](https://github.com/vllm-project/vllm), allowing the use of AWQ models for high-throughput concurrent inference in multi-user server scenarios. Note that, at the time of writing, overall throughput is still lower than running vLLM with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.


## Model Details

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.
The model belongs to the Phi-3 family with the Mini version in two variants [4K](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) and [128K](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) which is the context length (in tokens) that it can support.

The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures.
When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.

Resources and Technical Documentation:

+ [Phi-3 Microsoft Blog](https://aka.ms/phi3blog-april)
+ [Phi-3 Technical Report](https://aka.ms/phi3-tech-report)
+ [Phi-3 on Azure AI Studio](https://aka.ms/phi3-azure-ai)
+ Phi-3 GGUF: [4K](https://aka.ms/Phi3-mini-4k-instruct-gguf)
+ Phi-3 ONNX: [4K](https://aka.ms/Phi3-mini-4k-instruct-onnx)

## Prompt Format
<pre>
<|user|>
How to explain the Internet for a medieval knight?<|end|>
<|assistant|>
</pre>


## How to use

### using vLLM
```python
from vllm import LLM, SamplingParams

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=128)

# Create an LLM.
llm = LLM(model="Sreenington/Phi-3-mini-4k-instruct-AWQ", quantization="AWQ")

# Prompt template 
prompt = """
<|user|>
How to explain the Internet for a medieval knight?<|end|>
<|assistant|>
"""

outputs = llm.generate(prompt, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n Generated text:\n {generated_text!r}")
```