pankajmathur commited on
Commit
f5e84ff
1 Parent(s): 230f27e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ **Model Name: Qwen2 orca_mini_v7_7b**
6
+ # Qwen2 orca_mini_v7_7b is trained with various SFT Datasets
7
+
8
+ <img src="https://huggingface.co/pankajmathur/orca_mini_v5_8b/resolve/main/orca_minis_small.jpeg" width="auto" />
9
+
10
+ ## NOTICE
11
+ By providing proper credit and attribution, you are granted permission to use this model as a foundational base for further Full fine tuning, DPO, PPO or ORPO tuning and any kind of Merges.
12
+ I actively encourage users to customize and enhance the model according to their specific needs, as this version is designed to be a comprehensive general model.
13
+ Dive in and innovate!
14
+ ## Evaluation
15
+ Coming Soon..
16
+ <br>
17
+ ## Example Usage
18
+ Here is the ChatML prompt format
19
+ ```
20
+ <|im_start|>system
21
+ You are Orca Mini, a helpful AI assistant.<|im_end|>
22
+ <|im_start|>user
23
+ Hello Orca Mini, what can you do for me?<|im_end|>
24
+ <|im_start|>assistant
25
+ ```
26
+ Below shows a code example on how to use this model
27
+
28
+ ```python
29
+ from transformers import AutoModel, AutoTokenizer
30
+ model_slug = "pankajmathur/orca_mini_v7_7b"
31
+ model = AutoModel.from_pretrained(model_slug)
32
+ tokenizer = AutoTokenizer.from_pretrained(model_slug)
33
+ messages = [
34
+ {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."},
35
+ {"role": "user", "content": "Hello Orca Mini, what can you do for me?"}
36
+ ]
37
+ gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
38
+ model.generate(**gen_input)
39
+ ```
40
+
41
+ **Quants**
42
+
43
+ GGUF : Coming Soon
44
+
45
+ AWQ: Coming Soon
46
+
47
+
48
+ ### Processing Long Texts (Based upon Qwen2-7B-Instruct suggestions at https://huggingface.co/Qwen/Qwen2-7B-Instruct)
49
+
50
+ To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
51
+
52
+ For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:
53
+
54
+ 1. **Install vLLM**: You can install vLLM by running the following command.
55
+
56
+ ```bash
57
+ pip install "vllm>=0.4.3"
58
+ ```
59
+
60
+ Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).
61
+
62
+ 2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
63
+ ```json
64
+ {
65
+ "architectures": [
66
+ "Qwen2ForCausalLM"
67
+ ],
68
+ // ...
69
+ "vocab_size": 152064,
70
+ // adding the following snippets
71
+ "rope_scaling": {
72
+ "factor": 4.0,
73
+ "original_max_position_embeddings": 32768,
74
+ "type": "yarn"
75
+ }
76
+ }
77
+ ```
78
+ This snippet enable YARN to support longer contexts.
79
+
80
+ 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
81
+
82
+ ```bash
83
+ python -u -m vllm.entrypoints.openai.api_server --model pankajmathur/orca_mini_v7_7b
84
+ ```
85
+ Then you can access the Chat API by:
86
+ ```bash
87
+ curl http://localhost:8000/v1/chat/completions \
88
+ -H "Content-Type: application/json" \
89
+ -d '{
90
+ "model": "pankajmathur/orca_mini_v7_7b",
91
+ "messages": [
92
+ {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."},
93
+ {"role": "user", "content": "Hello Orca Mini, what can you do for me?"}
94
+ ]
95
+ }'
96
+ ```
97
+ **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.