bhushans commited on
Commit
55992e7
1 Parent(s): a41ec2c

adding export and demo instructions

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md CHANGED
@@ -44,6 +44,59 @@ accross various devices, can be found [here](https://aihub.qualcomm.com/models/l
44
  | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 104.953 ms | 316 - 4785 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
45
  | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## License
49
  - The license for the original implementation of Llama-v2-7B-Chat can be found
 
44
  | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 104.953 ms | 316 - 4785 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
45
  | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
46
 
47
+ ## Deploying Llama 2 on-device
48
+
49
+ Large Language Model (LLM) such as [Llama 2](https://llama.meta.com/llama2/) has the following complexities to deploy on-device:
50
+ 1. Model size is too large to fit in device memory for inference
51
+ 2. Multi-Head Attention (MHA) has large activations leading to fallback from accelerators
52
+ 3. High model load and inference time
53
+
54
+ We can tackle the above constraints with the following steps:
55
+ 1. Quantize weights to reduce on-disk model size, e.g., int8 or int4 weights
56
+ 2. Quantize activations to reduce inference time memory pressure
57
+ 3. Graph transformations to reduce inference time memory pressure, e.g., Multi-Head to Split-Head Attention (MHA -> SHA)
58
+ 4. Graph transformations to convert or decompose operations into more accelerator friendly operations e.g. Linear to Conv
59
+ 5. For LLM with 7B or more parameters, above steps are still not good enough on mobile,
60
+ hence we go one step further and split model into sub-parts.
61
+
62
+ Here, we divide the model into 4 parts in order to
63
+ 1. Make model exportable with low memory usage
64
+ 2. Avoid inference time out-of-memory errors
65
+
66
+ In order to export Llama 2, please ensure
67
+ 1. Host machine has >40GB memory (RAM+swap-space)
68
+ 2. If you don't have enough memory, export.py will dump instructions to increase swap space accordingly
69
+
70
+
71
+ ## Example & Usage
72
+
73
+ Install the package via pip:
74
+ ```bash
75
+ pip install "qai_hub_models[llama_v2_7b_chat_quantized]"
76
+ ```
77
+
78
+
79
+ Once installed, run the following simple CLI demo:
80
+
81
+ ```bash
82
+ python -m qai_hub_models.models.llama_v2_7b_chat_quantized.demo
83
+ ```
84
+ More details on the CLI tool can be found with the `--help` option. See
85
+ [demo.py](demo.py) for sample usage of the model including pre/post processing
86
+ scripts. Please refer to our [general instructions on using
87
+ models](../../../#getting-started) for more usage instructions.
88
+
89
+ ## Export for on-device deployment
90
+
91
+ This repository contains export scripts that produce a model optimized for
92
+ on-device deployment. This can be run as follows:
93
+
94
+ ```bash
95
+ python -m qai_hub_models.models.llama_v2_7b_chat_quantized.export
96
+ ```
97
+ Additional options are documented with the `--help` option. Note that the above
98
+ script requires access to Deployment instructions for Qualcomm® AI Hub.
99
+
100
 
101
  ## License
102
  - The license for the original implementation of Llama-v2-7B-Chat can be found