|
--- |
|
license: llama2 |
|
language: |
|
- ja |
|
- en |
|
--- |
|
|
|
## Model Overview |
|
This model is a compiled version of [Watashiha-Llama-2-13B-Ogiri-sft](https://huggingface.co/watashiha/Watashiha-Llama-2-13B-Ogiri-sft) designed to run on AWS's inf2 instances. |
|
|
|
The compilation was done following the instructions in this article: |
|
https://huggingface.co/docs/optimum-neuron/tutorials/llama2-13b-chatbot |
|
|
|
* License: LLAMA 2 COMMUNITY LICENSE] |
|
|
|
## How to Use |
|
1. Launch an **inf2.xlarge** instance on AWS EC2. |
|
As downloading the model requires about 50GB, it is recommended to set the storage size to 256GB or more. |
|
Please use the following AMI: |
|
**Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20240102** |
|
|
|
2. Execute the following command to activate the provided Python environment. |
|
```bash |
|
source /opt/aws_neuron_venv_pytorch/bin/activate |
|
``` |
|
|
|
3. Install **optimum**. |
|
```bash |
|
pip install optimum[neuronx] |
|
``` |
|
|
|
4. Once the above steps are completed, execute the provided source code. |
|
```python |
|
from optimum.neuron import NeuronModelForCausalLM |
|
from transformers import AutoTokenizer |
|
|
|
model_name = "watashiha/Watashiha-Llama-2-13B-Ogiri-sft-neuron" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) |
|
model = NeuronModelForCausalLM.from_pretrained(model_name) |
|
|
|
odai = "What happens when a clock is hungry?" |
|
text = f""" |
|
Below is a combination of instructions explaining the task and contextually relevant input. Write a response that appropriately fulfills the request. |
|
|
|
Instructions: |
|
The input sentence is a prompt for a comedy skit. Generate a funny punchline that aligns with the prompt. |
|
|
|
Input: |
|
{odai} |
|
|
|
Response: |
|
""" |
|
text = text.lstrip() |
|
|
|
token_ids = tokenizer.encode(text, return_tensors="pt") |
|
input_len = token_ids.shape[1] |
|
output_ids = model.generate( |
|
token_ids, |
|
max_length=input_len + 64, |
|
do_sample=True, |
|
top_p=0.9, |
|
top_k=50, |
|
temperature=0.8, |
|
pad_token_id=tokenizer.pad_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
) |
|
output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True) |
|
print(output) |
|
""" |
|
Below is a combination of instructions explaining the task and contextually relevant input. Write a response that appropriately fulfills the request. |
|
|
|
Instructions: |
|
The input sentence is a prompt for a comedy skit. Generate a funny punchline that aligns with the prompt. |
|
|
|
Input: |
|
{odai} |
|
|
|
Response: |
|
It takes time to get back on top! |
|
""" |
|
``` |
|
|
|
### Parameters for compilation |
|
|
|
#### input_shapes |
|
``` |
|
{ |
|
"batch_size": 1, |
|
"sequence_length": 1024, |
|
} |
|
``` |
|
|
|
#### compiler_args |
|
``` |
|
{ |
|
"num_cores": 2, |
|
"auto_cast_type": 'bf16', |
|
} |
|
``` |