Llama2 7b Ctranslate2 int8 quantized version

Browse files

Files changed (6) hide show

README.md +77 -1
config.json +6 -0
gitattributes +35 -0
model.bin +3 -0
tokenizer.model +3 -0
vocabulary.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,79 @@
 ---
-license: apache-2.0
 ---

 ---
+tags:
+- ctranslate2
 ---
+"Ctranslate2" is an amazing library that runs these models.  They are faster, more accurate, and use less VRAM/RAM than GGML and GPTQ models.
+How to run with instructions: https://github.com/BBC-Esq
+- COMING SOON
+Learn more about the amazing "ctranslate2" technology:"
+- https://github.com/OpenNMT/CTranslate2
+- https://opennmt.net/CTranslate2/index.html
+<details>
+<summary><b>Compatibility and Data Formats</b></summary>
+| Format          | Approximate Size Compared to `float32` | Nvidia GPU Required "Compute" | Accuracy Summary |
+|-----------------|----------------------------|-----------------|--------------------------|
+| `float32`       | 100%                       | 1.0        | Offers more precision and a wider range.  Most un-quantized models use this. |
+| `int16`         | 51.37%                     | 1.0        | Same as `int8` but with a larger range. |
+| `float16`       | 50.00%                     | 5.3  (e.g. Nvidia 10 Series and Higher)      | Suitable for scientific computations; balance between precision and memory. |
+| `bfloat16`      | 50.00%                     | 8.0  (e.g. Nvidia 30 Series and Higher)      | Often used in neural network training; larger exponent range than `float16`. |
+| `int8_float32`  | 27.47%                     | test manually (see below)             | Combines low precision integer with high precision float. Useful for mixed data. |
+| `int8_float16`  | 26.10%                     | test manually (see below)             | Combines low precision integer with medium precision float. Saves memory. |
+| `int8_bfloat16` | 26.10%                     | test manually (see below)             | Combines low precision integer with reduced precision float. Efficient for neural nets. |
+| `int8`          | 25%                        | 1.0        | Lower precision, suitable for whole numbers within a specific range. Often used where memory is crucial. |
+| Web Link                                                                      | Description                                                                                                         |
+|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| [CUDA GPUs Supported](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)       | See what level of "compute" your Nvidia GPU supports.                                                               |
+| [CTranslate2 Quantization](https://opennmt.net/CTranslate2/quantization.html#implicit-type-conversion-on-load) | Even if your GPU/CPU doesn't support the data type of the model you download, "ctranslate2" will automatically run the model in a way that's compatible. |
+| [Bfloat16 Floating-Point Format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format#bfloat16_floating-point_format) | Visualize data formats.                                                                                                           |
+| [Nvidia Floating-Point](https://docs.nvidia.com/cuda/floating-point/index.html) | Technical discussion.                                                                                    |
+</details>
+<details>
+<summary><b>Check Compatibility Manually</b></summary>
+Open a command prompt and run the following commands (may require CUDA toolkit and cuDNN installed as well, need to doublecheck this):
+   ```bash
+   pip install ctranslate2
+   ```
+   ```bash
+   python
+   ```
+   ```python
+   import ctranslate2
+   ```
+Check GPU/CUDA compatibility:
+   ```python
+   ctranslate2.get_supported_compute_types("cuda")
+   ```
+Check CPU compatibility:
+   ```python
+   ctranslate2.get_supported_compute_types("cpu")
+   ```
+It will print out your CPU/GPU compatibility.  For example, a system with a 4090 GPU and 13900k would have the following compatibility:
+|                 | **CPU** | **GPU** |
+|-----------------|---------|---------|
+| **`float32`**   | ✅       | ✅       |
+| **`int16`**     | ✅       |         |
+| **`float16`**   |         | ✅       |
+| **`bfloat16`**  |         | ✅       |
+| **`int8_float32`** | ✅     | ✅       |
+| **`int8_float16`** |       | ✅       |
+| **`int8_bfloat16`** |      | ✅       |
+| **`int8`**          | ✅     | ✅       |
+</details>
+![Comparison of ctranslate2 and ggml](https://huggingface.co/ctranslate2-4you/Llama-2-7b-chat-hf-ct2-int8/resolve/main/comparison%20of%20ctranslate2%20and%20ggml.png)

config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "layer_norm_epsilon": 1e-06,
+  "unk_token": "<unk>"
+}

gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eaf78562a5d1b37baeda6871df9f3eed17136506a4c161de5b2652c87882e3c7
+size 6744404022

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
+size 499723

vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff