Inference Speed

#8
by ThisIsSoMe - opened

I run the demo case in the way same as the way in Colab. My GPU is one 40G A100 chip.And answering the case in Colab costs 10 minutes. Is that normal?

Defog.ai org

Hi @ThisIsSoMe could this be because you're downloading the model for the first time when running the script, causing it to take 10 minutes? How long do subsequent requests take to complete? If you could share some sample scripts that you ran with to reproduce your latency issue we can look into it further.

codes:

import torch
import sqlparse
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
res = torch.cuda.is_available()
print(res)
model_name = "/root/paddlejob/workspace/env_run/sqlcoder2"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    # load_in_8bit=True,
    # load_in_4bit=True,
    device_map="auto",
    use_cache=True
)
#model = model.to("cuda")
eos_token_id = tokenizer.eos_token_id
print("loaded model")

for i in tqdm(range(1)):
  question = "What is our total revenue by product in the last week?"
  prompt = """### Task
  Generate a SQL query to answer the following question:
  `{question}`

  ### Database Schema
  This query will run on a database whose schema is represented in this string:
  CREATE TABLE products (
    product_id INTEGER PRIMARY KEY, -- Unique ID for each product
    name VARCHAR(50), -- Name of the product
    price DECIMAL(10,2), -- Price of each unit of the product
    quantity INTEGER  -- Current quantity in stock
  );

  CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY, -- Unique ID for each customer
    name VARCHAR(50), -- Name of the customer
    address VARCHAR(100) -- Mailing address of the customer
  );

  CREATE TABLE salespeople (
    salesperson_id INTEGER PRIMARY KEY, -- Unique ID for each salesperson
    name VARCHAR(50), -- Name of the salesperson
    region VARCHAR(50) -- Geographic sales region
  );

  CREATE TABLE sales (
    sale_id INTEGER PRIMARY KEY, -- Unique ID for each sale
    product_id INTEGER, -- ID of product sold
    customer_id INTEGER,  -- ID of customer who made purchase
    salesperson_id INTEGER, -- ID of salesperson who made the sale
    sale_date DATE, -- Date the sale occurred
    quantity INTEGER -- Quantity of product sold
  );

  CREATE TABLE product_suppliers (
    supplier_id INTEGER PRIMARY KEY, -- Unique ID for each supplier
    product_id INTEGER, -- Product ID supplied
    supply_price DECIMAL(10,2) -- Unit price charged by supplier
  );

  -- sales.product_id can be joined with products.product_id
  -- sales.customer_id can be joined with customers.customer_id
  -- sales.salesperson_id can be joined with salespeople.salesperson_id
  -- product_suppliers.product_id can be joined with products.product_id

  ### SQL
  Given the database schema, here is the SQL query that answers `{question}`:
  """.format(question=question)

  inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
  #print(inputs)
  generated_ids = model.generate(
      **inputs,
      num_return_sequences=1,
      eos_token_id=eos_token_id,
      pad_token_id=eos_token_id,
      max_new_tokens=400,
      do_sample=False,
      num_beams=1,

  )
  outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
  torch.cuda.empty_cache()
  #torch.cuda.synchronize()
  # empty cache so that you do generate more results w/o memory crashing
  # particularly important on Colab – memory management is much more straightforward
  # when running on an inference service
  print(outputs[0])
  print(sqlparse.format(outputs[0].split("```sql")[-1], reindent=True))

b16044ab29cc5a5360b96554e.png
369ef15c196a0c60e48b37352.png

maybe it takes 10+ minutes to complete.

Defog.ai org
β€’
edited Jan 6

Hi @ThisIsSoMe , thanks for the code and screenshot. From the code and screenshot, it seems to me that the part that is taking very long is the inference code within tqdm. I'm guessing this might be due to your environment not using the GPU. I saw the nvidia-smi print with 0MB memory used and 0% utilization but wasn't sure if that was before or after the model was loaded. To rule that out, could you print out the model device after loading it via model.device, and check nvidia-smi while running inference? 10 mins sound like the approximate time it would have taken on a cpu.

Sign up or log in to comment