Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 · Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型加载时间过长（近 2 小时）

问题简述：

在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下，加载 `Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4` 模型时，加载时间近2 小时（约 7096 秒）。即使安装`flash-attn`和`auto_gptq`库下，加载速度仍非常慢！

完整描述：

环境说明

硬件配置：RTX 3090(24G) * 3
操作系统：Ubuntu 20.04.6 LTS
Python 版本：3.10
CUDA 版本：12.2
PyTorch 版本：2.3.1
相关库版本：
- auto_gptq==0.7.1
- flash-attn==2.6.0 (从whl本地安装，cu122torch2.3cxx11abiFALSE-cp310版)
- optimum==1.21.2
- transformers==4.42.4 (从github源码本地安装)

问题描述

在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下，加载 Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型时，加载时间近2 小时（约 7096 秒）。即使安装flash-attn和auto_gptq库下，加载速度仍非常慢！

代码

import time
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda"  # the device to load the model onto

model_name_or_path = "/home/ubuntu/models/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4"

# 记录开始时间
start_time = time.time()

# 记录模型加载开始时间
model_load_start_time = time.time()

# 加载模型
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)

# 记录模型加载结束时间
model_load_end_time = time.time()
print("模型加载时间:", model_load_end_time - model_load_start_time)

输出日志

2024-07-17 15:43:40 /home/ubuntu/miniconda3/envs/timvan/lib/python3.10/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
2024-07-17 15:43:40   warnings.warn(
2024-07-17 16:25:29 
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [19:17<38:34, 1157.34s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [41:19<20:54, 1254.48s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [41:20<00:00, 681.82s/it] 
Loading checkpoint shards: 100%|██████████| 3/3 [41:20<00:00, 826.72s/it]
2024-07-17 16:25:29 Some weights of the model checkpoint at /home/ubuntu/models/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 were not used when initializing Qwen2MoeForCausalLM: ['model.layers.0.mlp.experts.0.down_proj.bias', 'model.layers.0.mlp.experts.0.gate_proj.bias', 'model.layers.0.mlp.experts.0.
.........
.........
2024-07-17 17:29:02 - This IS expected if you are initializing Qwen2MoeForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2024-07-17 17:29:02 - This IS NOT expected if you are initializing Qwen2MoeForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2024-07-17 17:41:54 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-17 17:41:55 The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2024-07-17 17:42:44 模型加载时间: 7096.050642490387
2024-07-17 17:42:44 推理时间: 49.09219837188721
2024-07-17 17:42:44 总时间: 7145.170483589172

哪里有问题，应该如何改进？

Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型加载时间过长（近 2 小时）

在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下，加载 Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 模型时，加载时间近2 小时（约 7096 秒）。即使安装flash-attn和auto_gptq库下，加载速度仍非常慢！

环境说明

问题描述

代码

输出日志

在使用3张 RTX 3090 (24GB) 卡的 Ubuntu 20.04 环境下，加载 `Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4` 模型时，加载时间近2 小时（约 7096 秒）。即使安装`flash-attn`和`auto_gptq`库下，加载速度仍非常慢！