Update README.md

Browse files

Files changed (1) hide show

README.md +54 -0

README.md CHANGED Viewed

@@ -121,6 +121,60 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
 | mathqa (0-shot)                  | 42.31          |  42.51              |
 | **Overall**                      | **TODO**       | **TODO**            |
 # Model Performance
 Need to install vllm nightly to get some recent changes

 | mathqa (0-shot)                  | 42.31          |  42.51              |
 | **Overall**                      | **TODO**       | **TODO**            |
+# Peak Memory Usage
+We can use the following code to get a sense of peak memory usage during inference:
+## Results
+| Benchmark        |                |                                |
+|------------------|----------------|--------------------------------|
+|                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq   |
+| Peak Memory (GB) | 8.91           | 5.70                           |
+## Benchmark Peak Memory
+```
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
+model_id = "microsoft/Phi-4-mini-instruct"
+quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+torch.cuda.reset_peak_memory_stats()
+prompt = "Hey, are you conscious? Can you talk to me?"
+messages = [
+    {
+        "role": "system",
+        "content": "",
+    },
+    {"role": "user", "content": prompt},
+]
+templated_prompt = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+print("Prompt:", prompt)
+print("Templated prompt:", templated_prompt)
+inputs = tokenizer(
+    templated_prompt,
+    return_tensors="pt",
+).to("cuda")
+generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+output_text = tokenizer.batch_decode(
+    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print("Response:", output_text[0][len(prompt):])
+mem = torch.cuda.max_memory_reserved() / 1e9
+print(f"Peak Memory Usage: {mem:.02f} GB")
+```
 # Model Performance
 Need to install vllm nightly to get some recent changes