8-bit version of Nous Research Nous-Hermes-13B, quantized using CTranslate2.
How to Use
The great thing about ctranslate2
is that it is basically self-contained (other than the tokenizer, for which you'll use a HuggingFace Transformers tokenizer). One quirk is that the translated model (depending which inference/generation method you use) may expect tokens (string) rather than token_ids (int). To get started, use git or huggingface_hub to download this repo. You'll point ctranslate2
at the folder for inference.
Example:
import ctranslate2
# point it to folder that contains all the files in this repo. here we're calling it nous-hermes-ct2
model = ctranslate2.Generator("nous-hermes-ct2", device="cuda")
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-13b", use_fast=False)
# get input ids, then turn them back into tokens
input_ids = tokenizer((
"### Instruction: What's the square root of 2?\n\n"
"### Response:")).input_ids
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
# generate completion, which is an iterator (you can stream tokens as they come out!)
it = model.generate_tokens(
input_tokens,
max_length=100
)
output = [token.token_id for token in it]
decoded = tokenizer.decode(output, skip_special_tokens=True)
print(decoded)
There are other methods for inference, including generate_batch
(no streaming, supports batched inputs), forward_batch
(only does 1 forward pass of the model), and score_batch
(computes token-level likelihood & perplexity). See docs here.
- Downloads last month
- 2