Text-to-text Generation Models (LLMs, Llama, GPT, ...)
Collection
5131 items
•
Updated
•
12
Frequently Asked Questions
model/smash_config.json
and are obtained after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, ...). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.You can run the smashed model with these steps:
pip install autoawq
from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized("PrunaAI/TinyLlama-TinyLlama_v1.1_math_code-AWQ-4bit-smashed", trust_remote_code=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama_v1.1_math_code")
input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=216)
tokenizer.decode(outputs[0])
The configuration info are in smash_config.json
.
The license of the smashed model follows the license of the original model. Please check the license of the original model TinyLlama/TinyLlama_v1.1_math_code before using this model which provided the base model. The license of the pruna-engine
is here on Pypi.
Base model
TinyLlama/TinyLlama_v1.1_math_code