andersonbcdefg/nous-hermes-13b-ct2

8-bit version of Nous Research Nous-Hermes-13B, quantized using CTranslate2.

How to Use

The great thing about ctranslate2 is that it is basically self-contained (other than the tokenizer, for which you'll use a HuggingFace Transformers tokenizer). One quirk is that the translated model (depending which inference/generation method you use) may expect tokens (string) rather than token_ids (int). To get started, use git or huggingface_hub to download this repo. You'll point ctranslate2 at the folder for inference.

Example:

import ctranslate2
# point it to folder that contains all the files in this repo. here we're calling it nous-hermes-ct2
model = ctranslate2.Generator("nous-hermes-ct2", device="cuda")
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-13b", use_fast=False)

# get input ids, then turn them back into tokens
input_ids = tokenizer((
    "### Instruction: What's the square root of 2?\n\n"
    "### Response:")).input_ids
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)

# generate completion, which is an iterator (you can stream tokens as they come out!)
it = model.generate_tokens(
    input_tokens, 
    max_length=100
)
output = [token.token_id for token in it]
decoded = tokenizer.decode(output, skip_special_tokens=True)
print(decoded)

There are other methods for inference, including generate_batch (no streaming, supports batched inputs), forward_batch (only does 1 forward pass of the model), and score_batch (computes token-level likelihood & perplexity). See docs here.