Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型加载时间过长(近 2 小时)

#2
by TimVan1 - opened

问题简述:

在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下,加载 Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型时,加载时间近2 小时(约 7096 秒)。即使安装flash-attnauto_gptq库下,加载速度仍非常慢!

完整描述:

环境说明

  • 硬件配置:RTX 3090(24G) * 3
  • 操作系统:Ubuntu 20.04.6 LTS
  • Python 版本:3.10
  • CUDA 版本:12.2
  • PyTorch 版本:2.3.1
  • 相关库版本:
    • auto_gptq==0.7.1
    • flash-attn==2.6.0 (从whl本地安装,cu122torch2.3cxx11abiFALSE-cp310版)
    • optimum==1.21.2
    • transformers==4.42.4 (从github源码本地安装)

问题描述

在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下,加载 Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型时,加载时间近2 小时(约 7096 秒)。即使安装flash-attnauto_gptq库下,加载速度仍非常慢!

代码

import time
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda"  # the device to load the model onto

model_name_or_path = "/home/ubuntu/models/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4"

# 记录开始时间
start_time = time.time()

# 记录模型加载开始时间
model_load_start_time = time.time()

# 加载模型
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)

# 记录模型加载结束时间
model_load_end_time = time.time()
print("模型加载时间:", model_load_end_time - model_load_start_time)

输出日志

2024-07-17 15:43:40 /home/ubuntu/miniconda3/envs/timvan/lib/python3.10/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
2024-07-17 15:43:40   warnings.warn(
2024-07-17 16:25:29 
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [19:17<38:34, 1157.34s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [41:19<20:54, 1254.48s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [41:20<00:00, 681.82s/it] 
Loading checkpoint shards: 100%|██████████| 3/3 [41:20<00:00, 826.72s/it]
2024-07-17 16:25:29 Some weights of the model checkpoint at /home/ubuntu/models/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 were not used when initializing Qwen2MoeForCausalLM: ['model.layers.0.mlp.experts.0.down_proj.bias', 'model.layers.0.mlp.experts.0.gate_proj.bias', 'model.layers.0.mlp.experts.0.
.........
.........
2024-07-17 17:29:02 - This IS expected if you are initializing Qwen2MoeForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2024-07-17 17:29:02 - This IS NOT expected if you are initializing Qwen2MoeForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2024-07-17 17:41:54 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-17 17:41:55 The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2024-07-17 17:42:44 模型加载时间: 7096.050642490387
2024-07-17 17:42:44 推理时间: 49.09219837188721
2024-07-17 17:42:44 总时间: 7145.170483589172

哪里有问题,应该如何改进?

Sign up or log in to comment