Fix generation with latest transformers

#1
by kylesayrs - opened

Purpose

  • Fix model generation

Related Issues

Changes

  • The latest transformers release removed support for past_key_values.get_max_length() in favor of past_key_values.get_max_cache_shape()
  • Add support for decoding tensors of ids, as is the typical output from generation

Testing

from transformers import AutoModelForCausalLM, AutoTokenizer

# Select model and load it.
MODEL_ID = "moonshotai/Moonlight-16B-A3B"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# # Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")
kylesayrs changed pull request title from Fix DynamicCache with latest transformers to Fix generation with latest transformers
kylesayrs changed pull request status to open
Cannot merge
This branch has merge conflicts in the following files:
  • modeling_deepseek.py

Sign up or log in to comment