Text-to-text Generation Models (LLMs, Llama, GPT, ...)
Collection
5130 items
•
Updated
•
12
Frequently Asked Questions
model/smash_config.json
and are obtained after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, ...). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.You can run the smashed model with these steps:
pip install quanto
from transformers import AutoModelForCausalLM, AutoTokenizer
IMPORTS
model = AutoModelForCausalLM.from_pretrained("PrunaAI/failspy-Phi-3-mini-128k-instruct-abliterated-v3-QUANTO-int8bit-smashed", trust_remote_code=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("failspy/Phi-3-mini-128k-instruct-abliterated-v3")
input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=216)
tokenizer.decode(outputs[0])
The configuration info are in smash_config.json
.
The license of the smashed model follows the license of the original model. Please check the license of the original model failspy/Phi-3-mini-128k-instruct-abliterated-v3 before using this model which provided the base model. The license of the pruna-engine
is here on Pypi.