Transformers documentation

调试

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

调试

多GPU网络问题调试

当使用DistributedDataParallel和多个GPU进行训练或推理时，如果遇到进程和（或）节点之间的互联问题，您可以使用以下脚本来诊断网络问题。

wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py

例如，要测试两个GPU之间的互联，请执行以下操作：

python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

如果两个进程能够相互通信并分配GPU内存，它们各自将打印出 “OK” 状态。

对于更多的GPU或节点，可以根据脚本中的参数进行调整。

在诊断脚本内部，您将找到更多详细信息，甚至有关如何在SLURM环境中运行它的说明。

另一种级别的调试是添加 NCCL_DEBUG=INFO 环境变量，如下所示：

NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

这将产生大量与NCCL相关的调试信息，如果发现有问题报告，您可以在线搜索以获取相关信息。或者，如果您不确定如何解释输出，可以在issue中分享日志文件。

下溢和上溢检测

目前，此功能仅适用于PyTorch。

对于多GPU训练，它需要使用DDP（torch.distributed.launch）。

此功能可以与任何基于nn.Module的模型一起使用。

如果您开始发现loss=NaN或模型因激活值或权重中的inf或nan而出现一些异常行为，就需要发现第一个下溢或上溢发生的地方以及导致它的原因。幸运的是，您可以通过激活一个特殊模块来自动进行检测。

如果您正在使用Trainer，只需把以下内容：

--debug underflow_overflow

添加到常规命令行参数中，或在创建TrainingArguments对象时传递 debug="underflow_overflow"。

如果您正在使用自己的训练循环或其他Trainer，您可以通过以下方式实现相同的功能：

from transformers.debug_utils import DebugUnderflowOverflow

debug_overflow = DebugUnderflowOverflow(model)

debug_utils.DebugUnderflowOverflow 将hooks插入模型，紧跟在每次前向调用之后，进而测试输入和输出变量，以及相应模块的权重。一旦在激活值或权重的至少一个元素中检测到inf或nan，程序将执行assert并打印报告，就像这样（这是在google/mt5-small下使用fp16混合精度捕获的）：

Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min  abs max  metadata
                  encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
                  encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
             None output[1]
2.25e-01 1.00e+04 output[2]
                  encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
                  encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
                  encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00      inf output

由于篇幅原因，示例输出中间的部分已经被缩减。

第二列显示了绝对最大元素的值，因此，如果您仔细查看最后frame，输入和输出都在1e4的范围内。因此，在使用fp16混合精度进行训练时，最后一步发生了溢出（因为在fp16下，在inf之前的最大数字是64e3）。为了避免在fp16下发生溢出，激活值必须保持低于1e4，因为1e4 * 1e4 = 1e8，因此任何具有大激活值的矩阵乘法都会导致数值溢出。

在跟踪的开始处，您可以发现问题发生在哪个批次（这里的Detected inf/nan during batch_number=0表示问题发生在第一个批次）。

每个报告的frame都以声明相应模块的层信息为开头，说明这一frame是为哪个模块报告的。如果只看这个frame：

                  encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output

在这里，encoder.block.2.layer.1.layer_norm 表示它是编码器的第二个块中第一层的layer norm。而 forward 的具体调用是 T5LayerNorm。

让我们看看该报告的最后几个frame：

Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min  abs max  metadata
[...]
                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
                  encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00      inf output

最后一个frame报告了Dropout.forward函数，第一个条目是唯一的输入，第二个条目是唯一的输出。您可以看到，它是从DenseReluDense类内的属性dropout中调用的。我们可以看到它发生在第2个块的第1层，也就是在第一个批次期间。最后，绝对最大的输入元素值为6.27e+04，输出也是inf。

您可以在这里看到，T5DenseGatedGeluDense.forward产生了输出激活值，其绝对最大值约为62.7K，非常接近fp16的上限64K。在下一个frame中，我们有Dropout对权重进行重新归一化，之后将某些元素归零，将绝对最大值推到了64K以上，导致溢出（inf）。

正如你所看到的，我们需要查看前面的frame, 从那里fp16数字开始变得非常大。

让我们将报告与models/t5/modeling_t5.py中的代码匹配：

class T5DenseGatedGeluDense(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
        self.dropout = nn.Dropout(config.dropout_rate)
        self.gelu_act = ACT2FN["gelu_new"]

    def forward(self, hidden_states):
        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
        hidden_linear = self.wi_1(hidden_states)
        hidden_states = hidden_gelu * hidden_linear
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.wo(hidden_states)
        return hidden_states

现在很容易看到dropout调用，以及所有之前的调用。

由于检测是在前向hook中进行的，这些报告将立即在每个forward返回后打印出来。

回到完整的报告，要采取措施并解决问题，我们需要往回看几个frame，在那里数字开始上升，并且最有可能切换到fp32模式以便在乘法或求和时数字不会溢出。当然，可能还有其他解决方案。例如，如果启用了amp，我们可以在将原始forward移到helper wrapper中后，暂时关闭它，如下所示：

def _forward(self, hidden_states):
    hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
    hidden_linear = self.wi_1(hidden_states)
    hidden_states = hidden_gelu * hidden_linear
    hidden_states = self.dropout(hidden_states)
    hidden_states = self.wo(hidden_states)
    return hidden_states


import torch


def forward(self, hidden_states):
    if torch.is_autocast_enabled():
        with torch.cuda.amp.autocast(enabled=False):
            return self._forward(hidden_states)
    else:
        return self._forward(hidden_states)

由于自动检测器仅报告完整frame的输入和输出，一旦知道在哪里查找，您可能还希望分析特定forward函数的中间阶段。在这种情况下，您可以使用detect_overflow辅助函数将检测器放到希望的位置，例如：

from debug_utils import detect_overflow


class T5LayerFF(nn.Module):
    [...]

    def forward(self, hidden_states):
        forwarded_states = self.layer_norm(hidden_states)
        detect_overflow(forwarded_states, "after layer_norm")
        forwarded_states = self.DenseReluDense(forwarded_states)
        detect_overflow(forwarded_states, "after DenseReluDense")
        return hidden_states + self.dropout(forwarded_states)

可以看到，我们添加了2个检测器，现在我们可以跟踪是否在forwarded_states中间的某个地方检测到了inf或nan。

实际上，检测器已经报告了这些，因为上面示例中的每个调用都是一个nn.Module，但假设如果您有一些本地的直接计算，这就是您将如何执行的方式。

此外，如果您在自己的代码中实例化调试器，您可以调整从其默认打印的frame数，例如：

from transformers.debug_utils import DebugUnderflowOverflow

debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)

特定批次的绝对最小值和最大值跟踪

当关闭下溢/上溢检测功能, 同样的调试类可以用于批处理跟踪。

假设您想要监视给定批次的每个forward调用的所有成分的绝对最小值和最大值，并且仅对批次1和3执行此操作，您可以这样实例化这个类：

debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])

现在，完整的批次1和3将以与下溢/上溢检测器相同的格式进行跟踪。

批次从0开始计数。

如果您知道程序在某个批次编号之后开始出现问题，那么您可以直接快进到该区域。以下是一个截取的配置示例输出：

                  *** Starting batch number=1 ***
abs min  abs max  metadata
                  shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
                  decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
                  decoder T5Stack
     not a tensor output
                  lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
                   T5ForConditionalGeneration
     not a tensor output

                  *** Starting batch number=3 ***
abs min  abs max  metadata
                  shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]

在这里，您将获得大量的frame被dump - 与您的模型中的前向调用一样多，它有可能符合也可能不符合您的要求，但有时对于调试目的来说，它可能比正常的调试器更容易使用。例如，如果问题开始发生在批次号150上，您可以dump批次149和150的跟踪，并比较数字开始发散的地方。

你还可以使用以下命令指定停止训练的批次号：

debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)

Update on GitHub

←实例化大模型使用 `torch.compile()` 优化推理→