TBA
This model is suitable for inference with 8-bit precision (both weight and KV Cache applied) on a 48GB GPU at 20K context or 6.5 BPW on an 80GB GPU at 100K context.
Memory | BPW (weights & KV Cache) | Context Length |
---|---|---|
48GB | 8 | 20K tokens |
48GB | 6.5 | 32K tokens |
80GB | 6.5 | 100K tokens |
Custom code is used, so ensure safety before using 'trust_remote_code=True'.
This model has undergone minimal human preference alignment. Proper system prompts and prompt engineering are necessary to ensure desirable responses. The model references the safety settings of the Gemini model series and has received minimal safety alignment towards system prompts to avoid generating severely inappropriate responses without specific instructions. By default, it can engage in discussions on relatively sensitive topics without violating human ethical and moral principles. Of course, you can always specify "uncensored" in your system prompt to obtain "raw" responses that have not been aligned with human values.
Modified from Cohere models (CohereForAI/c4ai-command-r-v01, CohereForAI/aya-23-35B), users should follow their AUPs.
Tokenizer is different from cohere - and chat template is ChatML.
For more information, please refer to the SFT version: https://huggingface.co/CausalLM/35b-beta-long
- Downloads last month
- 2