Model Card
Summary
- Base model: facebook/opt-2.7b
Usage
To use the model with the transformers
library on a machine with GPUs, first make sure you have the transformers
, accelerate
and torch
libraries installed.
pip install transformers==4.30.2
pip install einops==0.6.1
pip install accelerate==0.20.3
pip install torch==2.0.0
import torch
from transformers import pipeline
generate_text = pipeline(
model="shashank-mugiwara/thor",
torch_dtype="auto",
trust_remote_code=True,
use_fast=True,
device_map={"": "cuda:0"},
)
res = generate_text(
"What is thor service?",
min_new_tokens=2,
max_new_tokens=256,
do_sample=False,
num_beams=1,
temperature=float(0.3),
repetition_penalty=float(1.2),
renormalize_logits=True
)
print(res[0]["generated_text"])
You can print a sample prompt after the preprocessing step to see how it is feed to the tokenizer:
print(generate_text.preprocess("What is thor service?")["prompt_text"])
import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"shashank-mugiwara/thor",
use_fast=True,
padding_side="left",
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
"shashank-mugiwara/thor",
torch_dtype="auto",
device_map={"": "cuda:0"},
trust_remote_code=True,
)
generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)
res = generate_text(
"Why is drinking water so healthy?",
min_new_tokens=2,
max_new_tokens=256,
do_sample=False,
num_beams=1,
temperature=float(0.3),
repetition_penalty=float(1.2),
renormalize_logits=True
)
print(res[0]["generated_text"])
You may also construct the pipeline from the loaded model and tokenizer yourself and consider the preprocessing steps:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "shashank-mugiwara/thor" # either local folder or huggingface model name
prompt = "<|prompt|>What is thor service?</s><|answer|>"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
use_fast=True,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map={"": "cuda:0"},
trust_remote_code=True,
)
model.cuda().eval()
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
# generate configuration can be modified to your needs
tokens = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
min_new_tokens=2,
max_new_tokens=256,
do_sample=False,
num_beams=1,
temperature=float(0.3),
repetition_penalty=float(1.2),
renormalize_logits=True
)[0]
tokens = tokens[inputs["input_ids"].shape[1]:]
answer = tokenizer.decode(tokens, skip_special_tokens=True)
print(answer)
Quantization and sharding
You can load the models using quantization by specifying load_in_8bit=True
or load_in_4bit=True
. Also, sharding on multiple GPUs is possible by setting device_map=auto
.
- Downloads last month
- 14