qaihm-bot commited on
Commit
9a7f9da
1 Parent(s): 4dd4fc2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +93 -42
README.md CHANGED
@@ -17,9 +17,11 @@ tags:
17
 
18
  Llama 2 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to 4-bit weights and 16-bit activations making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-KVCache-Quantized's latency.
19
 
20
- This is based on the implementation of Llama-v2-7B-Chat found
21
- [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). More details on model performance
22
- accross various devices, can be found [here](https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized).
 
 
23
 
24
  ### Model Details
25
 
@@ -38,12 +40,6 @@ accross various devices, can be found [here](https://aihub.qualcomm.com/models/l
38
  - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
39
  - QNN-SDK: 2.19
40
 
41
-
42
- | Device | Chipset | Target Runtime | Inference Time (ms) | Peak Memory Range (MB) | Precision | Primary Compute Unit | Target Model
43
- | ---|---|---|---|---|---|---|---|
44
- | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 104.953 ms | 316 - 4785 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
45
- | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
46
-
47
  ## Deploying Llama 2 on-device
48
 
49
  Large Language Model (LLM) such as [Llama 2](https://llama.meta.com/llama2/) has the following complexities to deploy on-device:
@@ -68,39 +64,113 @@ In order to export Llama 2, please ensure
68
  2. If you don't have enough memory, export.py will dump instructions to increase swap space accordingly
69
 
70
 
71
- ## Example & Usage
72
 
73
- Install the package via pip:
 
 
 
 
 
 
 
 
 
 
74
  ```bash
75
- pip install "qai_hub_models[llama_v2_7b_chat_quantized]"
76
  ```
77
 
78
 
79
- Once installed, run the following simple CLI demo:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ```bash
82
  python -m qai_hub_models.models.llama_v2_7b_chat_quantized.demo
83
  ```
84
- More details on the CLI tool can be found with the `--help` option. See
85
- [demo.py](demo.py) for sample usage of the model including pre/post processing
86
- scripts. Please refer to our [general instructions on using
87
- models](../../../#getting-started) for more usage instructions.
88
 
89
- ## Export for on-device deployment
 
 
 
 
 
 
 
 
 
 
90
 
91
- This repository contains export scripts that produce a model optimized for
92
- on-device deployment. This can be run as follows:
 
 
 
93
 
94
  ```bash
95
  python -m qai_hub_models.models.llama_v2_7b_chat_quantized.export
96
  ```
97
- Additional options are documented with the `--help` option. Note that the above
98
- script requires access to Deployment instructions for Qualcomm® AI Hub.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  ## License
102
  - The license for the original implementation of Llama-v2-7B-Chat can be found
103
  [here](https://github.com/facebookresearch/llama/blob/main/LICENSE).
 
104
 
105
  ## References
106
  * [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
@@ -110,23 +180,4 @@ script requires access to Deployment instructions for Qualcomm® AI Hub.
110
  * Join [our AI Hub Slack community](https://qualcomm-ai-hub.slack.com/join/shared_invite/zt-2d5zsmas3-Sj0Q9TzslueCjS31eXG2UA#/shared-invite/email) to collaborate, post questions and learn more about on-device AI.
111
  * For questions or feedback please [reach out to us](mailto:ai-hub-support@qti.qualcomm.com).
112
 
113
- ## Usage and Limitations
114
-
115
- Model may not be used for or in connection with any of the following applications:
116
-
117
- - Accessing essential private and public services and benefits;
118
- - Administration of justice and democratic processes;
119
- - Assessing or recognizing the emotional state of a person;
120
- - Biometric and biometrics-based systems, including categorization of persons based on sensitive characteristics;
121
- - Education and vocational training;
122
- - Employment and workers management;
123
- - Exploitation of the vulnerabilities of persons resulting in harmful behavior;
124
- - General purpose social scoring;
125
- - Law enforcement;
126
- - Management and operation of critical infrastructure;
127
- - Migration, asylum and border control management;
128
- - Predictive policing;
129
- - Real-time remote biometric identification in public spaces;
130
- - Recommender systems of social media platforms;
131
- - Scraping of facial images (from the internet or otherwise); and/or
132
- - Subliminal manipulation
 
17
 
18
  Llama 2 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to 4-bit weights and 16-bit activations making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-KVCache-Quantized's latency.
19
 
20
+ This model is an implementation of Llama-v2-7B-Chat found [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
21
+ This repository provides scripts to run Llama-v2-7B-Chat on Qualcomm® devices.
22
+ More details on model performance across various devices, can be found
23
+ [here](https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized).
24
+
25
 
26
  ### Model Details
27
 
 
40
  - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
41
  - QNN-SDK: 2.19
42
 
 
 
 
 
 
 
43
  ## Deploying Llama 2 on-device
44
 
45
  Large Language Model (LLM) such as [Llama 2](https://llama.meta.com/llama2/) has the following complexities to deploy on-device:
 
64
  2. If you don't have enough memory, export.py will dump instructions to increase swap space accordingly
65
 
66
 
 
67
 
68
+ | Device | Chipset | Target Runtime | Inference Time (ms) | Peak Memory Range (MB) | Precision | Primary Compute Unit | Target Model
69
+ | ---|---|---|---|---|---|---|---|
70
+ | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 104.953 ms | 316 - 4785 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
71
+ | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
72
+
73
+
74
+
75
+ ## Installation
76
+
77
+ This model can be installed as a Python package via pip.
78
+
79
  ```bash
80
+ pip install "qai-hub-models[llama_v2_7b_chat_quantized]"
81
  ```
82
 
83
 
84
+
85
+ ## Configure Qualcomm® AI Hub to run this model on a cloud-hosted device
86
+
87
+ Sign-in to [Qualcomm® AI Hub](https://app.aihub.qualcomm.com/) with your
88
+ Qualcomm® ID. Once signed in navigate to `Account -> Settings -> API Token`.
89
+
90
+ With this API token, you can configure your client to run models on the cloud
91
+ hosted devices.
92
+ ```bash
93
+ qai-hub configure --api_token API_TOKEN
94
+ ```
95
+ Navigate to [docs](https://app.aihub.qualcomm.com/docs/) for more information.
96
+
97
+
98
+
99
+ ## Demo off target
100
+
101
+ The package contains a simple end-to-end demo that downloads pre-trained
102
+ weights and runs this model on a sample input.
103
 
104
  ```bash
105
  python -m qai_hub_models.models.llama_v2_7b_chat_quantized.demo
106
  ```
 
 
 
 
107
 
108
+ The above demo runs a reference implementation of pre-processing, model
109
+ inference, and post processing.
110
+
111
+ **NOTE**: If you want running in a Jupyter Notebook or Google Colab like
112
+ environment, please add the following to your cell (instead of the above).
113
+ ```
114
+ %run -m qai_hub_models.models.llama_v2_7b_chat_quantized.demo
115
+ ```
116
+
117
+
118
+ ### Run model on a cloud-hosted device
119
 
120
+ In addition to the demo, you can also run the model on a cloud-hosted Qualcomm®
121
+ device. This script does the following:
122
+ * Performance check on-device on a cloud-hosted device
123
+ * Downloads compiled assets that can be deployed on-device for Android.
124
+ * Accuracy check between PyTorch and on-device outputs.
125
 
126
  ```bash
127
  python -m qai_hub_models.models.llama_v2_7b_chat_quantized.export
128
  ```
129
+
130
+ ```
131
+ Profile Job summary of Llama-TokenGenerator-KVCache-Quantized
132
+ --------------------------------------------------
133
+ Device: Snapdragon X Elite CRD (11)
134
+ Estimated Inference Time: 118.14 ms
135
+ Estimated Peak Memory Range: 64.97-64.97 MB
136
+ Compute Units: NPU (34842) | Total (34842)
137
+
138
+ Profile Job summary of Llama-PromptProcessor-Quantized
139
+ --------------------------------------------------
140
+ Device: Snapdragon X Elite CRD (11)
141
+ Estimated Inference Time: 2302.57 ms
142
+ Estimated Peak Memory Range: 10.29-10.29 MB
143
+ Compute Units: NPU (31766) | Total (31766)
144
+
145
+
146
+ ```
147
 
148
 
149
+
150
+
151
+
152
+ ## Deploying compiled model to Android
153
+
154
+
155
+ The models can be deployed using multiple runtimes:
156
+ - TensorFlow Lite (`.tflite` export): [This
157
+ tutorial](https://www.tensorflow.org/lite/android/quickstart) provides a
158
+ guide to deploy the .tflite model in an Android application.
159
+
160
+
161
+ - QNN (`.so` export ): This [sample
162
+ app](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/sample_app.html)
163
+ provides instructions on how to use the `.so` shared library in an Android application.
164
+
165
+
166
+ ## View on Qualcomm® AI Hub
167
+ Get more details on Llama-v2-7B-Chat's performance across various devices [here](https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized).
168
+ Explore all available models on [Qualcomm® AI Hub](https://aihub.qualcomm.com/)
169
+
170
  ## License
171
  - The license for the original implementation of Llama-v2-7B-Chat can be found
172
  [here](https://github.com/facebookresearch/llama/blob/main/LICENSE).
173
+ - The license for the compiled assets for on-device deployment can be found [here](https://github.com/facebookresearch/llama/blob/main/LICENSE)
174
 
175
  ## References
176
  * [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
 
180
  * Join [our AI Hub Slack community](https://qualcomm-ai-hub.slack.com/join/shared_invite/zt-2d5zsmas3-Sj0Q9TzslueCjS31eXG2UA#/shared-invite/email) to collaborate, post questions and learn more about on-device AI.
181
  * For questions or feedback please [reach out to us](mailto:ai-hub-support@qti.qualcomm.com).
182
 
183
+