pradachan's picture
Upload folder using huggingface_hub
f71c233 verified
raw
history blame
39 kB
# LOGIC PRE-TRAINING OF LANGUAGE MODELS
**Anonymous authors**
Paper under double-blind review
ABSTRACT
Pre-trained language models (PrLMs) have been shown useful for enhancing
a broad range of natural language understanding (NLU) tasks. However, the
capacity for capturing logic relations in challenging NLU still remains a bottleneck
even for state-of-the-art PrLM enhancement, which greatly stalled their reasoning
abilities. Thus we propose logic pre-training of language models, leading to the
logic reasoning ability equipped PrLM, PROPHET. To let logic pre-training perform
on a clear, accurate, and generalized knowledge basis, we introduce fact instead of
the plain language unit in previous PrLMs. The fact is extracted through syntactic
parsing in avoidance of unnecessary complex knowledge injection. Meanwhile, it
enables training logic-aware models to be conducted on a more general language
text. To explicitly guide the PrLM to capture logic relations, three pre-training
objectives are introduced: 1) logical connectives masking to capture sentence-level
logics, 2) logical structure completion to accurately capture facts from the original
context, 3) logical path prediction on a logical graph to uncover global logic
relationships among facts. We evaluate our model on a broad range of NLP and
NLU tasks, including natural language inference, relation extraction, and machine
reading comprehension with logical reasoning. Results show that the extracted fact
and the newly introduced pre-training tasks can help PROPHET achieve significant
performance in all the downstream tasks, especially in logic reasoning related tasks.
1 INTRODUCTION
Machine reasoning in natural language understanding (NLU) aims to teach machines to understand
human languages by building and analyzing the connections between the facts, events, and
observations using logical analysis techniques like deduction and induction, which is one of the
ultimate goals towards human-parity intelligence. Although pre-trained language models (PrLMs),
such as BERT (Devlin et al., 2018), GPT (Radford et al., 2018), XLNet (Yang et al., 2019) and
RoBERTa (Liu et al., 2019), have established state-of-the-art performance on various aspects in NLU,
they are still short in complex language understanding tasks that involve reasoning (Helwe et al.,
2021). The major reason behind this is that they are insufficiently capable of capturing logic relations
such as negation (Kassner & Schütze, 2019), factual knowledge (Poerner et al., 2019), events (Rogers
et al., 2020), and so on. Many previous studies (Sun et al., 2021; Xiong et al., 2019; Wang et al.,
2020) are then motivated to inject knowledge into pre-trained models like BERT and RoBERTa.
However, they too much rely on massive external knowledge sources and ignore that language itself
is a natural knowledge carrier as the basis of acquiring logic reasoning ability (Ouyang et al., 2021).
Taking the context in Figure 1 as an example, previous approaches tend to focus on entities such as
the definition of "government" and the concepts related to it like "governor", but overlook the exact
relations inherent in this example, thus failing to model the complex reasoning process.
Given the fact that PrLMs are the key supporting components in natural language understanding,
in this work, we propose a fundamental solution by empowering the PrLMs with the capacity of
capturing logic relations, which is necessary for logical reasoning. However, logical reasoning can
only be implemented on the basis of clear, accurate, and generalized knowledge. Therefore, we
leverage fact as the conceptual knowledge unit to serve the basis for logic relation extraction. Fact is
organized as a triplet, i.e., in the form of predicate-argument structures, to represent the meaning such
as "who-did-what-to-whom" and "who-is-what". Compared with existing studies that inject complex
knowledge like knowledge graphs, the knowledge structure based on fact is far less complicated and
more general in representing events and relations in languages.
-----
On top of the fact-based knowledge structure, we present PROPHET, a logic-aware pre-trained
language model to learn the logic-aware relations in a universal way from very large texts. In detail,
we introduce three novel pre-training objectives based on the newly introduced knowledge structure
basis fact: 1) logical connectives masking for learning sentence-level logic connection. 2) logical
structure completion task on top of facts for regularization, aligning extracted fact with the original
context. 3) logical path prediction to capture the logic relationship between facts. PROPHET is
evaluated on a broad range of language understanding tasks: natural language inference, semantic
similarity, machine reading comprehension, etc. Experimental results show that the fact is useful
as the carrier for knowledge modeling, and the newly introduced pre-training tasks can improve
PROPHET and achieves significant performance on downstream tasks.[1]
2 RELATED WORK
2.1 PRE-TRAINED LANGUAGE MODELS IN NLP
Large pre-trained language models (Devlin et al., 2018; Liu et al., 2019; Radford et al., 2018)
have brought dramatic empirical improvements on almost every NLP task in the past few years.
A classical norm of pre-training is to train neural models on a large corpus with self-supervised
pre-training objectives. "Self-supervised" means that the supervision provided in the training process
is automatically generated from the raw text instead of manually generation. Designing effective
criteria for language modeling is one of the major topics in training pre-trained models, which decides
how the model captures the knowledge from large-scale unlabeled data. The most popular pre-training
objective used today is masked language modeling (MLM), initially used in BERT (Devlin et al.,
2018), which randomly masks out tokens, and the model is asked to uncover it given surrounding
context. Recent studies have investigated diverse variants of denoising strategies (Raffel et al., 2020;
Lewis et al., 2020), model architecture (Yang et al., 2019), and auxiliary objectives (Lan et al.,
2019; Joshi et al., 2020) to improve the model strength during pre-training. Although the existing
techniques have shown effectiveness in capturing syntactic and semantic information after large-scale
pre-training, they perform sensitivity to role reversal and struggles with pragmatic inference and
role-based event knowledge (Rogers et al., 2020), which are critical to the ultimate goal of complex
reasoning that requires to uncover logical structures. However, it is difficult for pre-trained language
models to capture the logical structure inherent in the texts since logical supervision is rarely available
during pre-training. Therefore, we are motivated to explicitly guide the model to capture such clues
via our newly introduced self-supervised tasks.
2.2 REASONING ABILITY FOR PRE-TRAINED LANGUAGE MODELS
There is a lot of work in the research line of enhancing reasoning abilities in pre-trained language
models via injecting knowledge. The existing approaches mainly design novel pre-training objectives
and leverage abundant knowledge sources such as WordNet (Miller, 1995).
Notably, ERNIE 3.0 (Sun et al., 2021) uses a broad range of pre-training objectives from word-aware,
structure-aware to knowledge-aware tasks, based on a 4TB corpus consisting of plain texts and a
large-scale knowledge graph. WKLM (Xiong et al., 2019) replaces entity mentions in the document
with other entities of the same type, and the objective is to distinguish the replaced entity from the
original ones. KEPLER (Wang et al., 2021b) encodes textual entity descriptions using embeddings
from a PrLM to take full advantage of the abundant textual information. K-Adapter (Wang et al., 2020)
designs neural adapters to distinguish the type of knowledge sources to capture various knowledge.
Our proposed method differs from previous studies in three aspects. Firstly, our model does not
require any external knowledge resources like previous methods that use WordNet, WikiData, etc.
We only use small-scale textual sources following the standard PrLMs like BERT (Devlin et al.,
2018), along with an off-the-shelf dependency parser to extract facts. Secondly, previous works only
consider triplet-level pre-training objectives. We proposed a multi-granularity pre-training strategy,
considering not only triplet-level information but also sentence-level and global knowledge to enhance
logic reasoning. Finally, we propose a new training mechanism apart from masked language modeling
(MLM), hoping to shed light on more logic pre-training strategies in this research line.
1Our codes have been uploaded as supplemental material, which will be open after the double review period.
-----
|Fact anarchists, participated, revolution revolution, opposite, movement they, met, suppression suppresion, after, stabilized government, was, stabilized|Logical Graph opposite participated movement anarchists revolution after coref met suppression stabalized they was same government stabalized|
|---|---|
|Despite concerns, anarchists participated in the Russian Revolution Text in opposition to the White movement. However, they met harsh suppression after the Bolshevik government was stabilized.||
Logical Graph
Fact
_they_ _was_
_government_ _stabalized_
Text
Figure 1: How the facts and logical graph constructed from raw text inputs. Edges in red denotes
additional edges added in the logical graph, while text with green indicates the sentence-level logical
connectives which will be mentioned in §4.
3 PRELIMINARIES
In this section, we will introduce the concept of fact and logical graph, which is the basis of PROPHET.
We will also describe extracting the fact for logical graph construction, as an example shown in
Figure 1.
3.1 FACT
Following Nakashole & Mitchell (2014) and Ouyang et al. (2021), we extract facts which are triplets
represented as T = {A1, P, A2}, where A1 and A2 are the arguments and P is the predicate between
them. It can well represent a broad range of facts, reflecting the notion of "who-did-what-to-whom"
and "who-is-what", etc.
We extract such facts in a syntactic way, which makes our approach generic and easy to apply. Given
a document, we first split the document into multiple sentences. For each sentence, we conduct
dependency parsing using StanfordCoreNLP (Manning et al., 2014).[2] For the analyzed dependencies,
basically, we consider verb phrases and some prepositions in the sentences as "predicates", and then
we search for their corresponding actors and actees as the "arguments".
3.2 LOGICAL GRAPH
A logical graph is an undirected (but is not required to be connected) graph that represents
logical dependency relation between components in facts. In logical graphs, nodes represent
argument/predicates in the fact, and edges indicate whether two nodes have relations in a fact.
Such a structure can well unveil and organize semantic information captured by facts. Besides, a
logical graph supports considerations among long-range dependencies via connecting arguments and
their relations in different facts across different spans.
We further show how to construct such graphs based on facts. Despite given relations in facts, we
design another two types of edges based on identical mentions and coreference information. (1) There
can be identical mentions in different sentences, resulting in repeated nodes in facts. We connect
nodes corresponding to the same non-pronoun arguments by edges with edge type same. (2) We
conduct coreference resolution on context using an off-to-shelf model to identify arguments in facts
that refer to the same one.[3] We add edges with type coref between them. The final logical graph is
denoted as S = (V, E), where V = Ai _P and i_ 1, 2 .
_∪_ _∈{_ _}_
[2https://stanfordnlp.github.io/CoreNLP/, we also tried to use OpenIE directly; however,](https://stanfordnlp.github.io/CoreNLP/)
the performance is not satisfactory.
[3https://github.com/huggingface/neuralcoref.](https://github.com/huggingface/neuralcoref)
-----
_anarchists, participated, revolution_
_revolution, opposite, movement_
_they, met, suppression_
_suppresion, after, stabilized_ Fact
_government, was, stabilized_
Fact
_.. However, they met harsh suppression_
Text
_after the Bolshevik government ..._
_after_ _stabilized_
_Text_
_Encoder_
Logical Graph
_participated_
_anarchists_ _revolution_
_[CLS]...suppression [MASK] the_
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|pre-training with fact-aware logics ... Text Encoder sentence connective logical path ... masking fact unit alignment prediction|||||||
||||||||
_Bolshevik...[SEP]_
_...suppression after the Bolshevik..._
_movement_
_after_
_stabalized_
_suppression, after, stabilized?_
_( suppression, after, stabilized )_
_revolution, government,_ _?_
_revolution, government, 1_
coref
_they_
_met_
_suppression_
_was_
same
Text Fact Unit Node Pairs
_government_
_stabalized_
Figure 2: An illustration about pre-training methods used in PROPHET. The model takes the text,
extracted fact and the randomly sampled node pairs in the logical graph as the input. The model is
pre-trained with three novel objectives. One is the standard masked language modeling applied to
sententious connectives, the others are fact alignment and logical path prediction.
4 PROPHET
4.1 MODEL ARCHITECTURE
We follow BERT (Devlin et al., 2018) and use a multi-layer bidirectional Transformer (Vaswani
et al., 2017) as the model architecture of PROPHET. For keeping the focus on the newly introduced
techniques, we will not review the ubiquitous Transformer architecture in detail. We develop
PROPHET by using exactly the same model architecture as BERT-base, where the model consists of
12 transformer layers, with 768 hidden size, 12 attention heads, and 110M model parameters in total.
4.2 LOGIC-AWARE PRE-TRAINING TASKS
We describe three pre-training tasks used for pre-training PROPHET in this section. Figure 2 is an
illustration of PROPHET pre-training. The first task is logical connectives masking (LCM) generalized
from masked language modeling (Devlin et al., 2018) for logical connectives to learn sentence-level
representation. The second task is logical structure completion (LSC) for learning logic relationship
inside a fact, where we first randomly mask arguments in facts, and then predict those items. Finally,
a logical path prediction (LPP) task is proposed for recognizing the logical relations of randomly
selected node pairs.
**Logical Connective Masking** Logical connective masking is an extension of the masked language
modeling (MLM) pre-training objective in Devlin et al. (2018), with a particular focus on connective
indication tokens. We use the Penn Discourse TreeBank 2.0 (PDTB) (Prasad et al., 2008) to draw
the logical relations among sentences. Specifically, PDTB 2.0 contains relations that are manually
annotated on the 1 million Wall Street Journal (WSJ) corpus and are broadly characterized into
"Explicit" and "Implicit" connectives. We use the "Explicit" type (in total 100 such connectives),
which apparently presents in sentences such as discourse adverbial "instead" or subordinating
conjunction "because". Taking all the identified connectives and some randomly sampled other
tokens (for a total 15% of the tokens of the original context), we replace them with a [MASK] token
80% of the time, with a random token 10% of the time and leave them unchanged 10% of the time.
The MLM objective is to predict the original tokens of these sampled tokens, which has proven
effective in previous works (Devlin et al., 2018; Liu et al., 2019). In this way, the model learns
to recover the logical relations for two given sentences, which helps language understanding. The
objective of this task is denoted as _conn._
_L_
**Logical Structure Completion** To align representation between the context and the extracted fact,
we introduce a pre-training task of logical structure completion. The motivation here is to encourage
-----
the model to learn the structure-aware representation that encodes the "Who-did-What-to-Whom"like meanings for better language understanding. To speak in detail, we randomly select a specific
proportion λ of the total facts (λ = 20% in this work), from a given context. For each chosen fact, we
either ask the model to complete "Argument-Predicate-?" or "Argument-?-Argument" (the templates
are selected based on equal probability). We denote all the blanks that need to be completed as m[a]
and m[p], denoting arguments and predicates, respectively. In our implementation, this objective is the
same as masked language modeling to keep simplicity, by using the original loss following Devlin
et al. (2018).
log D(xi _m[a], m[p]),_ (1)
_|_
_i∈Xa∪p_
_Lalign = −_
where D is the discriminator to predicts a token from a large vocabulary.
**Logical Path Prediction** To learn representation from the constructed logical graph, thus endowing
the model with global logical reasoning ability, we propose the pre-training task of predicting whether
there exists a path between two selected nodes in the logical graph. In this way, the model learns to
look at logical relations across a long distance of arguments and predicates in different facts.
We randomly sample 20% nodes from logical graph to form set V _[′], there are in total C_ [2]v[′] [node pairs.]
_|_ _|_
We have a maximum number maxp of node pairs to predict. To avoid bias in the training process, we
try to make sure that _[max]2_ _[p]_ are positive samples and the rest are negative samples, thus balancing
positive-negative ratios. If the number of positive/negative samples is less than _[max]2_ _[p]_, we just keep
the original pairs. Formally, the pre-training objective of this task is calculated as below following
Guo et al. (2020):
[δ log σ[vi, vj] + (1 _δ) log(1_ _σ[vi, vj])],_ (2)
_−_ _−_
_vi,jX∈V_ _[′]_
_LP ath = −_
where δ is 1 when vi and vj have a connected path and 0 otherwise. [vi, vj] denotes the concatenation
of representations of vi and vj.
The final training objective is the weighted sum of the above mentioned three losses.
_L = Lconn + Lalign + LP ath._ (3)
4.3 PRE-TRAINING DETAILS
We use the English Wikipedia (1.1 million articles in total), we sample the train and valid datasets
with a split ratio of 19 : 1 on the original datasets. We omit the "Reference" and "Literature" part
in a document to ensure data quality. Following the previous practice (Devlin et al., 2018), we
limit the length of sentences in each batch as up to 512 tokens and the batch size is 128. We use
Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 1e − 6, and weight decay is set as
0.01. We pre-train our model for 500k steps. We use 8 NVIDIA V100 32G GPUs, with FP16 and
deepspeed for training acceleration. Initialized by the pre-trained weights of BERTbase, we continue
training our models for 200k steps.
5 EXPERIMENTS
5.1 TASKS AND DATASETS
Our experiments are conducted on a broad range of language understanding tasks, including natural
language inference, machine reading comprehension, semantic similarity, and text classification.
Some of these tasks are a part of GLUE (Wang et al., 2018) benchmark. We also extend our
experiments to DocRED (Yao et al., 2019), a widely used benchmark of document-level relation
extraction for generalizability. To verify our model’s reasoning abilities of logic, we perform
experiments on two recent logical reasoning datasets in the form of machine reading comprehension,
ReClor (Yu et al., 2020) and LogiQA (Liu et al., 2020).
-----
Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B -
_In literature_
BERTbase 52.1 93.5 84.6/83.4 90.5 66.4 88.9 71.2 85.8 79.6
SemBERTbase 57.8 93.5 84.4/84.0 90.9 69.3 88.2 71.8 87.3 80.8
_Our implementation_
BERTbase 53.6 93.5 84.6/83.4 90.9 66.6 88.6 71.2 85.8 79.8
PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1
Table 1: Leaderboard results on GLUE benchmark. The number below each task denotes the number
of training examples. F1 scores are reported for QQP and MRPC, Spearman correlations are reported
for STS-B, and accuracy scores are reported for the other tasks.
Model _ReClor_ _LogiQA_
Dev Test Test-E Test-H Dev Test
Human Performance* - 63.0 57.1 67.2 - 86.0
_In literature_
FOCAL REASONER (Ouyang et al., 2021) 78.6 73.3 86.4 63.0 47.3 45.8
LReasoner (Wang et al., 2021a) 74.6 71.8 83.4 62.7 45.8 43.3
DAGN (Huang et al., 2021) 65.8 58.3 75.9 44.5 36.9 39.3
BERTlarge (Devlin et al., 2018) 53.8 49.8 72.0 32.3 34.1 31.0
XLNetlarge (Yang et al., 2019) 62.0 56.0 75.7 40.5 - -
RoBERTalarge (Liu et al., 2019) 62.6 55.6 75.5 40.0 35.0 35.3
DeBERTalarge (He et al., 2020) 74.4 68.9 83.4 57.5 44.4 41.5
_Our implementation_
BERTbase 51.2 47.3 71.6 28.2 33.8 32.1
PROPHET 53.4 48.8 72.4 32.2 35.2 34.1
Table 2: Accuracy on ReClor and LogiQA dataset. The public methods are based on large models.
5.2 RESULTS
Table 1 shows results on the GLUE benchmark datasets. We have the following observations from
the above results.
(1) PROPHET obtains substantial gains over the BERT baseline (continual trained for 200K steps
for a fair comparison), indicating that our model can work well in a general sense of language
understanding.
(2) PROPHET performs particularly well on language inference tasks including MNLI, QNLI, and
RTE,[4] which indicates our model’s ability to reasoning.
(3) Whether it is large-scale datasets such as QQP and MNLI or small datasets like COLA and SST-B,
our model demonstrates a consistent improvement, indicating its robustness.
(4) From Table 2, we can see that PROPHET improves the logical reasoning ability of BERT baseline
by a large margin. Especially, armed with our approach, the results on the two datasets for the
BERT-base model are comparable or even surpass those with BERT-large results.
In addition, we conducted experiments on a large-scale human-annotated dataset for document-level
relation extraction (Yao et al., 2019). The results are shown in Table 3.[5] From the table, we can see
that PROPHET still does well for relation extraction for documents by outperforming the baseline
4We exclude the problematic WNLI set.
5We only report the results for Ign F1 in the annotated setting as the distant supervision is too slow to train.
-----
_Dev_ _Test_
Model
F1 Intra-F1 Inter-F1 F1
BERTbase* (Devlin et al., 2018) 54.2 61.6 47.2 53.2
Two-Phase BERT* (Wang et al., 2019) 54.4 61.8 47.3 53.9
PROPHET 54.8 (↑0.6) 62.4 (↑0.8) 47.5 (↑0.3) 54.3 (↑1.1)
Table 3: Main results on the dev and test set for DocRED. * indicates that the results are taken from
Nan et al. (2020). Intra- and Inter-F1 indicates F1 scores for the intra- and inter-sentence relations
following the setting of Nan et al. (2020).
Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B -
PROPHET 57.0 93.9 85.3/84.3 91.4 69.8 89.5 72.0 86.0 81.1
w/o LCM 53.6 93.5 85.1/84.0 90.9 68.2 88.6 71.2 85.8 80.1
w/o LSC 53.6 93.6 85.0/84.1 91.3 69.0 88.9 71.4 85.9 80.3
w/o LPP 52.1 93.0 84.6/83.4 90.9 66.4 88.6 71.2 85.8 79.6
Table 4: Ablation studies of PROPHET on the test set of GLUE dataset.
substantially. It even surpasses the two-phase BERT. Also, our model is especially good at coping
with inter-sentence relations compared with baseline models, which means that our model is indeed
capable of synthesizing the information across multiple sentences of a document, verifying the
effectiveness of leveraging sententious and global information.
6 ANALYSIS
6.1 ABLATION STUDY
To investigate the impacts of different objectives introduced, we evaluate three variants of PROPHET
as described in Section 4.2: 1) the w/o LCM model adopts a substitute without logical connectives
masking as the pre-training objective, 2) the w/o LSC model is such that it leaves out the logical
structure completion objective, and 3) the w/o LPP model only uses the objectives of connective
masking and structure completion. The results are shown in Table 4.
Based on the ablation studies, we come to the following conclusions. Firstly, all three components
contribute to the performance as removing any one of them causes a performance drop on the average
score. Especially, the average point drops the most as we remove the logical path prediction objective,
which sheds light on the importance of modeling chain-like relations of events. Secondly, we can
see that logical path prediction contributes the most to the reasoning abilities as the performance on
language inference improves the most when we add the sententious connective masking objective
and the task of logical path prediction.
6.2 COMPARISON BETWEEN FACT AND ENTITY-LIKE KNOWLEDGE
We also replace the injected fact with common practice using entity-like knowledge, which is
using named entities. In detail, we change the arguments in facts into named entities recognized
by StanfordCoreNLP,[6] and leave the predicates extracted unchanged, resulting in the form of <
_NE1, predicate, NE2 > (NE stands for named entity). If a fact is not recognized with any named_
entities, we just leave it out.
The results are shown in Table 5. We can see that the performance is hurt a lot, even worse than
vanilla BERT. This is quite intuitive as the number of named entities is far less than our obtained fact,
[6https://stanfordnlp.github.io/CoreNLP/](https://stanfordnlp.github.io/CoreNLP/)
-----
Model _Classification_ _Language Inference_ _Semantic Similarity_ _Avg._
CoLA SST-2 MNLI QNLI RTE MRPC QQP STS-B -
PROPHET 57.0 93.6 85.0/84.1 91.4 69.8 89.2 71.4 86.0 81.0
w/ named entities 50.4 93.2 84.9/84.2 90.8 68.7 88.4 71.0 84.9 79.3
Table 5: Results on GLUE test set when replacing facts with named entities and key the relations
unchanged.
missing a lot of information inherent in the context. Whereas our introduced fact can well capture the
knowledge used in the reasoning process, providing a fundamental reasoning basis.
6.3 ATTENTION MATRIX HEATMAP
We plot the attention matrix in token level to see how our model interprets the context using heatmap
shown in Figure 3.
Figure 3: Heatmap of the attention matrix of vanilla BERT and our implemented PROPHET for the
sentence "However, they met harsh suppression after the Bolthevik government was stabilized.".
Weights are selected from the first head of the last attention layer.
From the figure, we can see that the vanilla BERT attends to delimiters, particularly punctuation
as suggested in Clark et al. (2019). In comparison, our model exhibits quite different attention
distribution. Firstly, our model clearly decreases the influences introduced by punctuation. Secondly,
our model pays more attention to tokens representing discourse-level information, such as "however"
and "after", which is consistent with our motivation. It also well captures the relations of pronouns.
The event characteristics are also illustrated as seen from the "after suppression" phrases.
6.4 EFFECT OF DIFFERENT CONTEXT LENGTH
We group samples into ten subsets according to an equal amount of samples (around 1000 samples
per interval) by context length since the majority of the samples concentrate on the interval of under
60. The statistics of MNLI-matched and MNLI-mismatched dev sets are shown in Table 6. Then we
calculate the accuracy of the baseline and PROPHET per group for both the matched and mismatched
set, as shown in Figure. 4. We observe that the performance of the baseline groups drops dramatically
when encountered with long contexts, especially for those longer than 45 tokens, while our model
performs more robustly on those intervals (the slope of the dashed line is more gentle).
6.5 CASE STUDY
We also give a case study to demonstrate that PROPHET could enhance the reasoning process in
language understanding. Given two sentences, we use PROPHET and BERT-base to predict whether
-----
Table 6: Distribution of context length on dev set of MNLI-matched and MNLI- mismatched dataset.
Dataset [0, 29) [30, 59) [60, 89) [90, 119) [120, 149) [150, 179) [180, 209) [210, 239)
MNLI-matched 39.2% 49.8% 9.6% 1.0% 0.12% 0.12% 0.02% 0.06%
MNLI-mismatched 33.7% 55.0% 9.7% 1.7% 0.3% 0.1% 0.1% 0%
89
88
87
86
85
84
83
82
81
80
87
86.5
86
85.5
85
84.5
84
83.5
83
82.5
82
ours
vanilla BERT
vanilla BERT
ours
(0,16] (16,21] (21,25] (25,29] (29,33] (33,38] (38,44] (44,51] (51,62] (62, ∞)
Context Length
(0,18] (18,24] (24,28] (28,32] (32,36] (36,40] (40,45] (45,51] (51,61] (61, ∞
Context Length
Figure 4: Accuracy of different context length on MNLI-match (left) and MNLI-mismatch (right)
dev set. There are approximate 1000 samples in each intervals.
the sentences are entailed or not. Results are shown in Figure 5. To see the language understanding
ability of our model, we made two subtle changes in the original training sample. Firstly, we change
the entity referred to in the sentence. We can see that PROPHET learns better alignment relations
between entities than BERT-base model. Additionally, we add a negation in the sentence. Although
this change is small, it completely changes the semantic of the sentence, and leads to a reversal of the
ground-truth labels. We can see that PROPHET is good at all the samples given, indicating that it is
not only good at reasoning in language understanding, but also is more robust than baseline models.
|Input Label Prediction|Unchanged|Entity change|Negation|
|---|---|---|---|
||Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SNCF.. not entailment Prophet: not entailment √ BERT-base: not entailment √|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is called SBB.. not entailment Prophet: not entailment √ BERT-base: entailment ×|Sentence 1: Note that SBB, CFF and FFS stand out for the main railway company, in German, French and Italian.. Sentence 2: The French railway company is not called SNCF.. entailment Prophet: nentailment √ BERT-base: not entailment ×|
Figure 5: We take an example from RTE dataset, and use PROPHET and BERT-base to predict the
label of the relations among two given sentences.
7 CONCLUSION
In this paper, we leverage fact in a newly pre-trained language model PROPHET to capture logic
relations essentially, in consideration of the fundamental role of PrLM serving in NLP and NLU tasks.
We introduce three novel pre-training tasks and show that PROPHET achieves significant improvement
over various logic reasoning involved NLP and NLU downstream tasks, including language inference,
sentence classification, semantic similarity, and machine reading comprehension. Further analysis
shows that our model can well interpret the inner logical structure of the context to aid the reasoning
process.
-----
REFERENCES
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at?
an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan,
Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with
data flow. arXiv preprint arXiv:2009.08366, 2020.
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert
with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
Chadi Helwe, Chloé Clavel, and Fabian M Suchanek. Reasoning with transformer-based models:
Deep learning, but shallow reasoning. In 3rd Conference on Automated Knowledge Base
_Construction, 2021._
Yinya Huang, Meng Fang, Yu Cao, Liwei Wang, and Xiaodan Liang. DAGN: Discourse-aware graph
network for logical reasoning. In NAACL, 2021.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy.
SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the
_Association for Computational Linguistics, 8:64–77, 2020._
Nora Kassner and Hinrich Schütze. Negated and misprimed probes for pretrained language models:
Birds can talk, but cannot fly. arXiv preprint arXiv:1911.03343, 2019.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
_arXiv:1412.6980, 2014._
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut.
ALBERT: A lite BERT for self-supervised learning of language representations. In International
_[Conference on Learning Representations, 2019. URL https://openreview.net/pdf?id=](https://openreview.net/pdf?id=H1eA7AEtvS)_
[H1eA7AEtvS.](https://openreview.net/pdf?id=H1eA7AEtvS)
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehension. In Proceedings of the 58th Annual
_Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020._
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A
challenge dataset for machine reading comprehension with logical reasoning. In Christian Bessiere
(ed.), Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence,
_IJCAI-20, pp. 3622–3628. International Joint Conferences on Artificial Intelligence Organization,_
[7 2020. doi: 10.24963/ijcai.2020/501. URL https://doi.org/10.24963/ijcai.2020/501.](https://doi.org/10.24963/ijcai.2020/501)
Main track.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining
Approach. arXiv e-prints, art. arXiv:1907.11692, July 2019.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David
McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd
_annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60,_
2014.
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):
39–41, 1995.
-----
Ndapandula Nakashole and Tom Mitchell. Language-aware truth assessment of fact candidates. In
_Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume_
_1: Long Papers), pp. 1009–1019, 2014._
Guoshun Nan, Zhijiang Guo, Ivan Sekuli´c, and Wei Lu. Reasoning with latent structure refinement
for document-level relation extraction. arXiv preprint arXiv:2005.06312, 2020.
Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. Fact-driven logical reasoning. arXiv preprint
_arXiv:2105.10334, 2021._
Nina Poerner, Ulli Waltinger, and Hinrich Schütze. Bert is not a knowledge base (yet): Factual
knowledge vs. name-based reasoning in unsupervised qa. arXiv preprint arXiv:1911.03681, 2019.
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K Joshi, and
Bonnie L Webber. The penn discourse treebank 2.0. In LREC. Citeseer, 2008.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training. 2018.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 21:1–67, 2020.
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about
how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020.
Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi
Chen, Yanbin Zhao, Yuxiang Lu, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training
for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
_processing systems, pp. 5998–6008, 2017._
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue:
A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint
_arXiv:1804.07461, 2018._
Hong Wang, Christfried Focke, Rob Sylvester, Nilesh Mishra, and William Wang. Fine-tune bert for
docred with two-step process. arXiv preprint arXiv:1909.11898, 2019.
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao, Daxin Jiang,
Ming Zhou, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv
_preprint arXiv:2002.01808, 2020._
Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and
Nan Duan. Logic-driven context extension and data augmentation for logical reasoning of text.
_arXiv preprint arXiv:2105.03659, 2021a._
Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian
Tang. Kepler: A unified model for knowledge embedding and pre-trained language representation.
_Transactions of the Association for Computational Linguistics, 9:176–194, 2021b._
Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. Pretrained encyclopedia:
Weakly supervised knowledge-pretrained language model. arXiv preprint arXiv:1912.09637, 2019.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.
Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural
_information processing systems, 32, 2019._
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie
Zhou, and Maosong Sun. Docred: A large-scale document-level relation extraction dataset. arXiv
_preprint arXiv:1906.06127, 2019._
Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset
requiring logical reasoning. In International Conference on Learning Representations (ICLR),
April 2020.
-----