Transformers
English
ctranslate2
int8
float16
Inference Endpoints
michaelfeil commited on
Commit
70fd993
1 Parent(s): f7506c8

Upload togethercomputer/RedPajama-INCITE-Instruct-3B-v1 ctranslate fp16 weights

Browse files
Files changed (1) hide show
  1. README.md +12 -5
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
  tags:
3
- - ctransate2
4
  - int8
 
 
5
  license: apache-2.0
6
  language:
7
  - en
@@ -32,18 +34,18 @@ inference:
32
  max_new_tokens: 128
33
  ---
34
  # # Fast-Inference with Ctranslate2
35
- Speedup inference by 2x-8x using int8 inference in C++
36
 
37
  quantized version of [togethercomputer/RedPajama-INCITE-Instruct-3B-v1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1)
38
  ```bash
39
- pip install hf-hub-ctranslate2>=2.0.6 ctranslate2>=3.13.0
40
  ```
41
  Converted on 2023-05-19 using
42
  ```
43
  ct2-transformers-converter --model togethercomputer/RedPajama-INCITE-Instruct-3B-v1 --output_dir /home/michael/tmp-ct2fast-RedPajama-INCITE-Instruct-3B-v1 --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json .gitattributes --quantization float16
44
  ```
45
 
46
- Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2)
47
  - `compute_type=int8_float16` for `device="cuda"`
48
  - `compute_type=int8` for `device="cpu"`
49
 
@@ -61,7 +63,7 @@ model = GeneratorCT2fromHfHub(
61
  tokenizer=AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Instruct-3B-v1")
62
  )
63
  outputs = model.generate(
64
- text=["How do you call a fast Flan-ingo?", "User: How are you doing?"],
65
  )
66
  print(outputs)
67
  ```
@@ -71,6 +73,11 @@ This is just a quantized version. Licence conditions are intended to be idential
71
 
72
  # Original description
73
 
 
 
 
 
 
74
 
75
  # RedPajama-INCITE-Instruct-3B-v1
76
 
 
1
  ---
2
  tags:
3
+ - ctranslate2
4
  - int8
5
+ - float16
6
+
7
  license: apache-2.0
8
  language:
9
  - en
 
34
  max_new_tokens: 128
35
  ---
36
  # # Fast-Inference with Ctranslate2
37
+ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
38
 
39
  quantized version of [togethercomputer/RedPajama-INCITE-Instruct-3B-v1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1)
40
  ```bash
41
+ pip install hf-hub-ctranslate2>=2.0.6
42
  ```
43
  Converted on 2023-05-19 using
44
  ```
45
  ct2-transformers-converter --model togethercomputer/RedPajama-INCITE-Instruct-3B-v1 --output_dir /home/michael/tmp-ct2fast-RedPajama-INCITE-Instruct-3B-v1 --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json .gitattributes --quantization float16
46
  ```
47
 
48
+ Checkpoint compatible to [ctranslate2>=3.13.0](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2>=2.0.6](https://github.com/michaelfeil/hf-hub-ctranslate2)
49
  - `compute_type=int8_float16` for `device="cuda"`
50
  - `compute_type=int8` for `device="cpu"`
51
 
 
63
  tokenizer=AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Instruct-3B-v1")
64
  )
65
  outputs = model.generate(
66
+ text=["How do you call a fast Flan-ingo?", "User: How are you doing? Bot:"],
67
  )
68
  print(outputs)
69
  ```
 
73
 
74
  # Original description
75
 
76
+ tags:
77
+ - ctranslate2
78
+ - int8
79
+ - float16
80
+
81
 
82
  # RedPajama-INCITE-Instruct-3B-v1
83