TaylorAI
/

Llama2-7B-SFT-LIMA-ct2

Inference Endpoints

Model card Files Files and versions Community

andersonbcdefg commited on Jul 19, 2023

Commit

eeed786

•

1 Parent(s): 73cfaae

Create README.md

Files changed (1) hide show

README.md +63 -0

README.md ADDED Viewed

	@@ -0,0 +1,63 @@

+This is a quantized version of Llama2-7B trained on the LIMA (Less is More for Alignment) dataset, located at `GAIR/lima` on HuggingFace.
+To get started using this model, you'll need to install `transformers` (for the tokenizer) and `ctranslate2` (for the model). You'll
+also need `huggingface_hub` to easily download the weights.
+```
+pip install -U transformers ctranslate2 huggingface_hub
+```
+Next, download this repository from the Hub. You can download the files manually and place them in a folder, or use the HuggingFace library
+to download them programatically. Here, we're putting them in a local directory called "Llama2_TaylorAI".
+```python
+from huggingface_hub import snapshot_download
+snapshot_download(repo_id="TaylorAI/Llama2-7B-SFT-LIMA-ct2", local_dir="Llama2_TaylorAI")
+```
+Then, you can perform inference as follows. Note that the model was trained with the separator `\\n\\n###\\n\\n` between the prompt/instruction
+and the model's response, so to get the expected result, you'll want to append this to your prompt. The model was also trained to finish its
+output with the suffix `@@@`, so you can stop generating tokens once you reach this suffix, or use it to split the completion and keep the
+relevant part. All of this is shown in the example below.
+```
+from ctranslate2 import Generator
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("TaylorAI/Llama2-7B-SFT-LIMA-ct2")
+# point this wherever you stored this repository. if you have a GPU, use device="cuda", otherwise "cpu"
+model = Generator("Llama2_TaylorAI", device="cuda")
+# Unlike normal Transformers models, Ctranslate2 operates on actual "tokens" (little subword strings), not token ids (integers)
+def tokenize_for_ct2(
+    prompt: str,
+    prompt_suffix: str,
+    tokenizer: Any,
+):
+    full_prompt = prompt + prompt_suffix
+    input_ids = tokenizer.encode(full_prompt)
+    input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
+    return input_tokens
+example_input = "What is the meaning of life?"
+example_input_tokens = tokenize_for_ct2(example_input, prompt_suffix="\n\n###\n\n", tokenizer=tokenizer)
+# the model returns an iterator, from which we can lazily stream tokens
+result = []
+it = model.generate_tokens(
+  example_input_tokens,
+  max_length=1024,
+  sampling_topp=0.9,
+  sampling_temperature=1.0,
+  repetition_penalty=1.5
+)
+stop_sequence = "@@@"
+for step in it:
+  result.append(step.token_id)
+  # stop early if we have generated the suffix
+  output_so_far = tokenizer.decode(completion_tokens, skip_special_tokens=True)
+  if output_so_far.endswith(stop_sequence):
+    break
+output = tokenizer.decode(completion_tokens, skip_special_tokens=True).split(stop_sequence)[0]
+print(output)
+```