NLP Course documentation

QA 管道中的快速标记器

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

QA 管道中的快速标记器

Ask a Question Open In Colab Open In Studio Lab

我们现在将深入研究 question-answering 管道,看看如何利用偏移量从上下文中获取手头问题的答案,有点像我们在上一节中对分组实体所做的。然后我们将看到我们如何处理最终被截断的非常长的上下文。如果您对问答任务不感兴趣,可以跳过此部分。

使用 question-answering 管道

正如我们在Chapter 1,我们可以使用 question-answering 像这样的管道以获得问题的答案:

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)
{'score': 0.97773,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

与其他管道不同,它不能截断和拆分长于模型接受的最大长度的文本(因此可能会丢失文档末尾的信息),此管道可以处理非常长的上下文,并将返回回答这个问题,即使它在最后:

long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)
{'score': 0.97149,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

让我们看看它是如何做到这一切的!

使用模型进行问答

与任何其他管道一样,我们首先对输入进行标记化,然后通过模型将其发送。默认情况下用于的检查点 question-answering 管道是distilbert-base-cased-distilled-squad(名称中的“squad”来自模型微调的数据集;我们将在Chapter 7):

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

请注意,我们将问题和上下文标记为一对,首先是问题

An example of tokenization of question and context

问答模型的工作方式与我们迄今为止看到的模型略有不同。以上图为例,该模型已经过训练,可以预测答案开始的标记的索引(此处为 21)和答案结束处的标记的索引(此处为 24)。这就是为什么这些模型不返回一个 logits 的张量,而是返回两个:一个用于对应于答案的开始标记的 logits,另一个用于对应于答案的结束标记的 logits。由于在这种情况下我们只有一个包含 66 个标记的输入,我们得到:

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)
torch.Size([1, 66]) torch.Size([1, 66])

为了将这些 logits 转换为概率,我们将应用一个 softmax 函数——但在此之前,我们需要确保我们屏蔽了不属于上下文的索引。我们的输入是 [CLS] question [SEP] context [SEP] ,所以我们需要屏蔽问题的标记以及 [SEP] 令牌。我们将保留 [CLS] 然而,因为某些模型使用它来表示答案不在上下文中。

由于我们将在之后应用 softmax,我们只需要用一个大的负数替换我们想要屏蔽的 logits。在这里,我们使用 -10000

import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

现在我们已经正确屏蔽了与我们不想预测的位置相对应的 logits,我们可以应用 softmax:

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

在这个阶段,我们可以采用开始和结束概率的 argmax——但我们最终可能会得到一个大于结束索引的开始索引,所以我们需要采取更多的预防措施。我们将计算每个可能的概率 start_indexend_index 在哪里 start_index <= end_index ,然后取元组 (start_index, end_index) 以最高的概率。

假设事件“答案开始于 start_index ”和“答案结束于 end_index ” 要独立,答案开始于的概率 start_index 并结束于 end_index 是: start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]

所以,要计算所有的分数,我们只需要计算所有的产品start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}] where start_index <= end_index.

首先让我们计算所有可能的产品:

scores = start_probabilities[:, None] * end_probabilities[None, :]

然后我们将屏蔽这些值 start_index > end_index 通过将它们设置为 0 (其他概率都是正数)。这 torch.triu() 函数返回作为参数传递的 2D 张量的上三角部分,因此它会为我们做屏蔽:

import numpy as np

scores = torch.triu(scores)

现在我们只需要得到最大值的索引。由于 PyTorch 将返回展平张量中的索引,因此我们需要使用地板除法 // 和模数 % 操作以获得 start_indexend_index

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

我们还没有完全完成,但至少我们已经有了正确的答案分数(您可以通过将其与上一节中的第一个结果进行比较来检查这一点):

0.97773

✏️ 试试看! 计算五个最可能的答案的开始和结束索引。

我们有 start_indexend_index 就标记而言的答案,所以现在我们只需要转换为上下文中的字符索引。这是偏移量非常有用的地方。我们可以像在令牌分类任务中一样抓住它们并使用它们:

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

现在我们只需要格式化所有内容以获得我们的结果:

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)
{'answer': 'Jax, PyTorch and TensorFlow',
 'start': 78,
 'end': 105,
 'score': 0.97773}

太棒了!这和我们的第一个例子一样!

✏️ 试试看! 使用您之前计算的最佳分数来显示五个最可能的答案。要检查您的结果,请返回到第一个管道并在调用它时传入。

处理长上下文

如果我们尝试对我们之前作为示例使用的问题和长上下文进行标记化,我们将获得比在 question-answering 管道(即 384):

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))
461

因此,我们需要在最大长度处截断我们的输入。有几种方法可以做到这一点,但我们不想截断问题,只想截断上下文。由于上下文是第二个句子,我们将使用 “only_second” 截断策略。那么出现的问题是问题的答案可能不在截断上下文中。例如,在这里,我们选择了一个答案在上下文末尾的问题,当我们截断它时,答案不存在

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))
"""
[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP

[UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

[UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internal [SEP]
"""

这意味着模型将很难选择正确的答案。为了解决这个问题, question-answering 管道允许我们将上下文分成更小的块,指定最大长度。为确保我们不会在完全错误的位置拆分上下文以找到答案,它还包括块之间的一些重叠。

我们可以让分词器(快或慢)通过添加来为我们做这件事 return_overflowing_tokens=True ,我们可以指定我们想要的重叠 stride 争论。这是一个使用较小句子的示例:

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))
'[CLS] This sentence is not [SEP]'
'[CLS] is not too long [SEP]'
'[CLS] too long but we [SEP]'
'[CLS] but we are going [SEP]'
'[CLS] are going to split [SEP]'
'[CLS] to split it anyway [SEP]'
'[CLS] it anyway. [SEP]'

正如我们所看到的,句子已被分成多个块,使得每个条目 inputs[“input_ids”] 最多有 6 个标记(我们需要添加填充以使最后一个条目与其他条目的大小相同)并且每个条目之间有 2 个标记的重叠。

让我们仔细看看标记化的结果:

print(inputs.keys())
dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])

正如预期的那样,我们得到了输入 ID 和一个注意力掩码。最后一个键, overflow_to_sample_mapping , 是一个映射,它告诉我们每个结果对应哪个句子——这里我们有 7 个结果,它们都来自我们通过标记器的(唯一)句子:

print(inputs["overflow_to_sample_mapping"])
[0, 0, 0, 0, 0, 0, 0]

当我们将几个句子标记在一起时,这更有用。例如,这个:

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

让我们:

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

这意味着第一个句子像以前一样分成 7 个块,接下来的 4 个块来自第二个句子。

现在让我们回到我们的长期背景。默认情况下 question-answering 管道使用的最大长度为 384,正如我们之前提到的,步长为 128,这对应于模型微调的方式(您可以通过传递 max_seq_lenstride 调用管道时的参数)。因此,我们将在标记化时使用这些参数。我们还将添加填充(具有相同长度的样本,因此我们可以构建张量)以及请求偏移量:

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

那些 inputs 将包含模型期望的输入 ID 和注意力掩码,以及偏移量和 overflow_to_sample_mapping 我们刚刚谈到。由于这两个不是模型使用的参数,我们将把它们从 inputs (我们不会存储地图,因为它在这里没有用)在将其转换为张量之前:

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)
torch.Size([2, 384])

我们的长上下文被分成两部分,这意味着在它通过我们的模型后,我们将有两组开始和结束 logits:

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)
torch.Size([2, 384]) torch.Size([2, 384])

和以前一样,我们在采用 softmax 之前首先屏蔽不属于上下文的标记。我们还屏蔽了所有填充标记(由注意掩码标记):

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

然后我们可以使用 softmax 将我们的 logits 转换为概率:

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

下一步与我们对小上下文所做的类似,但我们对两个块中的每一个都重复它。我们将分数归因于所有可能的答案跨度,然后取得分最高的跨度:

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)
[(0, 18, 0.33867), (173, 184, 0.97149)]

这两个候选对应于模型能够在每个块中找到的最佳答案。该模型对正确答案在第二部分更有信心(这是一个好兆头!)。现在我们只需要将这两个标记跨度映射到上下文中的字符跨度(我们只需要映射第二个标记以获得我们的答案,但看看模型在第一个块中选择了什么很有趣)。

✏️ 试试看! 修改上面的代码以返回五个最可能的答案的分数和跨度(总计,而不是每个块)。

offsets 我们之前抓取的实际上是一个偏移量列表,每个文本块有一个列表:

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)
{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.97149}

如果我们忽略第一个结果,我们会得到与这个长上下文的管道相同的结果——是的!

✏️ 试试看! 使用您之前计算的最佳分数来显示五个最可能的答案(对于整个上下文,而不是每个块)。要检查您的结果,请返回到第一个管道并在调用它时传入。

我们对分词器功能的深入研究到此结束。我们将在下一章再次将所有这些付诸实践,届时我们将向您展示如何在一系列常见的 NLP 任务上微调模型。