SPEED Llama Checkpoint

This repository contains a SPEED LoRA adapter for meta-llama/Llama-3.1-8B-Instruct.

It is not a standalone full-weight checkpoint. At load time, Transformers/PEFT needs access to the base model meta-llama/Llama-3.1-8B-Instruct, then applies the adapter in this repository.

The original SPEED source repository is not required on the inference server, but trust_remote_code=True is required because the checkpoint bundles custom SPEED modeling code.

SPEED Configuration

Setting Value
Base model meta-llama/Llama-3.1-8B-Instruct
Model family llama
Adapter checkpoint true
Lower SPEED layers 24
Prompt prefill mode lower
Upper prompt targets bos,query,assistant
Context mode 0
Prefill attention causal
Decode tokens full-depth

Installation

Use a CUDA/PyTorch environment suitable for the base model.

pip install "transformers>=4.57,<5" "peft>=0.19,<1" huggingface_hub accelerate safetensors

Install PyTorch separately if your server needs a specific CUDA wheel.

Basic SPEED Inference

import sys
import torch
from huggingface_hub import snapshot_download

model_id = "jeongseokoh/Llama-3.1-8B-Instruct_SPEED-24-BoS-Query"
LOWER_K = 24
SPEED_UPPER_TARGETS = ('bos', 'query', 'assistant')

repo_dir = snapshot_download(model_id)
sys.path.insert(0, repo_dir)

from speed_inference import load_speed_model

model, tokenizer = load_speed_model(
    repo_dir,
    dtype=torch.bfloat16,
    device_map="auto",
    speed_generate=True,
    speed_layers=LOWER_K,
    speed_attn='causal',
    speed_upper_targets=SPEED_UPPER_TARGETS,
)
model.eval()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

with torch.inference_mode():
    outputs = model.generate(
        speed_generate=True,
        messages=messages,
        lower_k=LOWER_K,
        speed_upper_targets=SPEED_UPPER_TARGETS,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        return_dict_in_generate=True,
    )

prompt_len = outputs["prompt_lengths"][0]
generated_ids = outputs["sequences"][0, prompt_len:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

Document or Long-Context Inference

question = "What are the key claims in the document?"
document = "..."  # long document text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": question},
]

with torch.inference_mode():
    outputs = model.generate(
        speed_generate=True,
        messages=messages,
        context=document,
        lower_k=LOWER_K,
        speed_upper_targets=SPEED_UPPER_TARGETS,
        max_new_tokens=512,
        do_sample=False,
        return_dict_in_generate=True,
    )

prompt_len = outputs["prompt_lengths"][0]
print(tokenizer.decode(outputs["sequences"][0, prompt_len:], skip_special_tokens=True))

Important Notes

  • Use snapshot_download() and the bundled speed_inference.load_speed_model() entrypoint as shown above. The original SPEED source repository is not needed on the inference server.
  • For adapter checkpoints, do not pass SPEED-only arguments such as speed_generate directly to AutoModelForCausalLM.from_pretrained(model_id, ...); Transformers/PEFT may route that call through the base model class, which does not accept those arguments.
  • Always pass speed_generate=True for SPEED inference. Ordinary generate() uses the normal generation path.
  • For adapter checkpoints, the base model meta-llama/Llama-3.1-8B-Instruct must be downloadable from the inference server.
  • pipeline("text-generation", ...) is not recommended because SPEED needs structured arguments such as messages, context, and lower_k.
  • vLLM serving is not covered by this upload artifact.

Bundled Modeling Files

Only the modeling files needed for llama are bundled:

  • modeling_speed_llama.py
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jeongseokoh/Llama-3.1-8B-Instruct_SPEED-24-BoS-Query

Adapter
(2486)
this model