qolina commited on
Commit
52aeca4
1 Parent(s): 420f561

add example files

Browse files
examples/BERT_body.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Language model pre-training has been shown to be effective for improving many natural language processing tasks ( Dai and Le , 2015 ; Peters et al. , 2018a ; Radford et al. , 2018 ; Howard and Ruder , 2018 ) . These include sentence-level tasks such as natural language inference ( Bowman et al. , 2015 ; Williams et al. , 2018 ) and paraphrasing ( Dolan and Brockett , 2005 ) , which aim to predict the relationships between sentences by analyzing them holistically , as well as token-level tasks such as named entity recognition and question answering , where models are required to produce fine-grained output at the token level ( Tjong Kim Sang and De Meulder , 2003 ; Rajpurkar et al. , 2016 ) . There are two existing strategies for applying pre-trained language representations to downstream tasks : feature-based and fine-tuning . The feature-based approach , such as ELMo ( Peters et al. , 2018a ) , uses task-specific architectures that include the pre-trained representations as additional features . The fine-tuning approach , such as the Generative Pre-trained Transformer ( OpenAI GPT ) ( Radford et al. , 2018 ) , introduces minimal task-specific parameters , and is trained on the downstream tasks by simply fine-tuning all pretrained parameters . The two approaches share the same objective function during pre-training , where they use unidirectional language models to learn general language representations . We argue that current techniques restrict the power of the pre-trained representations , especially for the fine-tuning approaches . The major limitation is that standard language models are unidirectional , and this limits the choice of architectures that can be used during pre-training . For example , in OpenAI GPT , the authors use a left-toright architecture , where every token can only attend to previous tokens in the self-attention layers of the Transformer ( Vaswani et al. , 2017 ) . Such restrictions are sub-optimal for sentence-level tasks , and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering , where it is crucial to incorporate context from both directions . In this paper , we improve the fine-tuning based approaches by proposing BERT : Bidirectional Encoder Representations from Transformers . BERT alleviates the previously mentioned unidirectionality constraint by using a `` masked language model '' ( MLM ) pre-training objective , inspired by the Cloze task ( Taylor , 1953 ) . The masked language model randomly masks some of the tokens from the input , and the objective is to predict the original vocabulary id of the masked arXiv:1810.04805v2 [ cs.CL ] 24 May 2019 word based only on its context . Unlike left-toright language model pre-training , the MLM objective enables the representation to fuse the left and the right context , which allows us to pretrain a deep bidirectional Transformer . In addition to the masked language model , we also use a `` next sentence prediction '' task that jointly pretrains text-pair representations . The contributions of our paper are as follows : • We demonstrate the importance of bidirectional pre-training for language representations . Unlike Radford et al . ( 2018 ) , which uses unidirectional language models for pre-training , BERT uses masked language models to enable pretrained deep bidirectional representations . This is also in contrast to Peters et al . ( 2018a ) , which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs . • We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures . BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks , outperforming many task-specific architectures . • BERT advances the state of the art for eleven NLP tasks . The code and pre-trained models are available at https : //github.com/ google-research/bert . There is a long history of pre-training general language representations , and we briefly review the most widely-used approaches in this section . Learning widely applicable representations of words has been an active area of research for decades , including non-neural ( Brown et al. , 1992 ; Ando and Zhang , 2005 ; Blitzer et al. , 2006 ) and neural Pennington et al. , 2014 ) methods . Pre-trained word embeddings are an integral part of modern NLP systems , offering significant improvements over embeddings learned from scratch ( Turian et al. , 2010 ) . To pretrain word embedding vectors , left-to-right language modeling objectives have been used ( Mnih and Hinton , 2009 ) , as well as objectives to discriminate correct from incorrect words in left and right context . These approaches have been generalized to coarser granularities , such as sentence embeddings Logeswaran and Lee , 2018 ) or paragraph embeddings ( Le and Mikolov , 2014 ) . To train sentence representations , prior work has used objectives to rank candidate next sentences ( Jernite et al. , 2017 ; Logeswaran and Lee , 2018 ) , left-to-right generation of next sentence words given a representation of the previous sentence , or denoising autoencoder derived objectives ( Hill et al. , 2016 ) . ELMo and its predecessor ( Peters et al. , 2017 ( Peters et al. , , 2018a generalize traditional word embedding research along a different dimension . They extract context-sensitive features from a left-to-right and a right-to-left language model . The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations . When integrating contextual word embeddings with existing task-specific architectures , ELMo advances the state of the art for several major NLP benchmarks ( Peters et al. , 2018a ) including question answering ( Rajpurkar et al. , 2016 ) , sentiment analysis ( Socher et al. , 2013 ) , and named entity recognition ( Tjong Kim Sang and De Meulder , 2003 ) . Melamud et al . ( 2016 ) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs . Similar to ELMo , their model is feature-based and not deeply bidirectional . Fedus et al . ( 2018 ) shows that the cloze task can be used to improve the robustness of text generation models . As with the feature-based approaches , the first works in this direction only pre-trained word embedding parameters from unlabeled text ( Collobert and Weston , 2008 ) . More recently , sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task ( Dai and Le , 2015 ; Howard and Ruder , 2018 ; Radford et al. , 2018 ) . The advantage of these approaches is that few parameters need to be learned from scratch . At least partly due to this advantage , OpenAI GPT ( Radford et al. , 2018 ) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark ( Wang et al. , 2018a ) . Left-to-right language model- BERT BERT E [ CLS ] E 1 E [ SEP ] ... E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... T M ' [ CLS ] Tok 1 [ SEP ] ... Tok N Tok 1 ... TokM Question Paragraph Start/End Span BERT E [ CLS ] E 1 E [ SEP ] ... E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... T M ' [ CLS ] Tok 1 [ SEP ] ... Figure 1 : Overall pre-training and fine-tuning procedures for BERT . Apart from output layers , the same architectures are used in both pre-training and fine-tuning . The same pre-trained model parameters are used to initialize models for different down-stream tasks . During fine-tuning , all parameters are fine-tuned . [ CLS ] is a special symbol added in front of every input example , and [ SEP ] is a special separator token ( e.g . separating questions/answers ) . ing and auto-encoder objectives have been used for pre-training such models ( Howard and Ruder , 2018 ; Radford et al. , 2018 ; Dai and Le , 2015 ) . There has also been work showing effective transfer from supervised tasks with large datasets , such as natural language inference ( Conneau et al. , 2017 ) and machine translation ( McCann et al. , 2017 ) . Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models , where an effective recipe is to fine-tune models pre-trained with Ima-geNet ( Deng et al. , 2009 ; Yosinski et al. , 2014 ) . We introduce BERT and its detailed implementation in this section . There are two steps in our framework : pre-training and fine-tuning . During pre-training , the model is trained on unlabeled data over different pre-training tasks . For finetuning , the BERT model is first initialized with the pre-trained parameters , and all of the parameters are fine-tuned using labeled data from the downstream tasks . Each downstream task has separate fine-tuned models , even though they are initialized with the same pre-trained parameters . The question-answering example in Figure 1 will serve as a running example for this section . A distinctive feature of BERT is its unified architecture across different tasks . There is mini-mal difference between the pre-trained architecture and the final downstream architecture . Model Architecture BERT 's model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al . ( 2017 ) and released in the tensor2tensor library . 1 Because the use of Transformers has become common and our implementation is almost identical to the original , we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al . ( 2017 ) as well as excellent guides such as `` The Annotated Transformer . '' 2 In this work , we denote the number of layers ( i.e. , Transformer blocks ) as L , the hidden size as H , and the number of self-attention heads as A . 3 We primarily report results on two model sizes : BERT BASE ( L=12 , H=768 , A=12 , Total Param-eters=110M ) and BERT LARGE ( L=24 , H=1024 , A=16 , Total Parameters=340M ) . BERT BASE was chosen to have the same model size as OpenAI GPT for comparison purposes . Critically , however , the BERT Transformer uses bidirectional self-attention , while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left . 4 Input/Output Representations To make BERT handle a variety of down-stream tasks , our input representation is able to unambiguously represent both a single sentence and a pair of sentences ( e.g. , Question , Answer ) in one token sequence . Throughout this work , a `` sentence '' can be an arbitrary span of contiguous text , rather than an actual linguistic sentence . A `` sequence '' refers to the input token sequence to BERT , which may be a single sentence or two sentences packed together . We use WordPiece embeddings ( Wu et al. , 2016 ) with a 30,000 token vocabulary . The first token of every sequence is always a special classification token ( [ CLS ] ) . The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks . Sentence pairs are packed together into a single sequence . We differentiate the sentences in two ways . First , we separate them with a special token ( [ SEP ] ) . Second , we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B . As shown in Figure 1 , we denote input embedding as E , the final hidden vector of the special [ CLS ] token as C ∈ R H , and the final hidden vector for the i th input token as T i ∈ R H . For a given token , its input representation is constructed by summing the corresponding token , segment , and position embeddings . A visualization of this construction can be seen in Figure 2 . Unlike Peters et al . ( 2018a ) and Radford et al . ( 2018 ) , we do not use traditional left-to-right or right-to-left language models to pre-train BERT . Instead , we pre-train BERT using two unsupervised tasks , described in this section . This step is presented in the left part of Figure 1 . Task # 1 : Masked LM Intuitively , it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model . Unfortunately , standard conditional language models can only be trained left-to-right or right-to-left , since bidirectional conditioning would allow each word to indirectly `` see itself '' , and the model could trivially predict the target word in a multi-layered context . former is often referred to as a `` Transformer encoder '' while the left-context-only version is referred to as a `` Transformer decoder '' since it can be used for text generation . In order to train a deep bidirectional representation , we simply mask some percentage of the input tokens at random , and then predict those masked tokens . We refer to this procedure as a `` masked LM '' ( MLM ) , although it is often referred to as a Cloze task in the literature ( Taylor , 1953 ) . In this case , the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary , as in a standard LM . In all of our experiments , we mask 15 % of all WordPiece tokens in each sequence at random . In contrast to denoising auto-encoders ( Vincent et al. , 2008 ) , we only predict the masked words rather than reconstructing the entire input . Although this allows us to obtain a bidirectional pre-trained model , a downside is that we are creating a mismatch between pre-training and fine-tuning , since the [ MASK ] token does not appear during fine-tuning . To mitigate this , we do not always replace `` masked '' words with the actual [ MASK ] token . The training data generator chooses 15 % of the token positions at random for prediction . If the i-th token is chosen , we replace the i-th token with ( 1 ) the [ MASK ] token 80 % of the time ( 2 ) a random token 10 % of the time ( 3 ) the unchanged i-th token 10 % of the time . Then , T i will be used to predict the original token with cross entropy loss . We compare variations of this procedure in Appendix C.2 . Many important downstream tasks such as Question Answering ( QA ) and Natural Language Inference ( NLI ) are based on understanding the relationship between two sentences , which is not directly captured by language modeling . In order to train a model that understands sentence relationships , we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus . Specifically , when choosing the sentences A and B for each pretraining example , 50 % of the time B is the actual next sentence that follows A ( labeled as IsNext ) , and 50 % of the time it is a random sentence from the corpus ( labeled as NotNext ) . As we show in Figure 1 , C is used for next sentence prediction ( NSP ) . 5 Despite its simplicity , we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI . 6 he likes play # # ing [ SEP ] my dog is cute [ SEP ] Input E [ CLS ] E he E likes E play E # # ing E [ SEP ] E my E dog E is E cute E [ SEP ] Token Embeddings E A E B E B E B E B E B E A E A E A E A E A Segment Embeddings E 0 E 6 E 7 E 8 E 9 E 10 E 1 E 2 E 3 E 4 E 5 Position Embeddings Figure 2 : BERT input representation . The input embeddings are the sum of the token embeddings , the segmentation embeddings and the position embeddings . The NSP task is closely related to representationlearning objectives used in Jernite et al . 2017and Logeswaran and Lee ( 2018 ) . However , in prior work , only sentence embeddings are transferred to down-stream tasks , where BERT transfers all parameters to initialize end-task model parameters . Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training . For the pre-training corpus we use the BooksCorpus ( 800M words ) and English Wikipedia ( 2,500M words ) . For Wikipedia we extract only the text passages and ignore lists , tables , and headers . It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark ( Chelba et al. , 2013 ) in order to extract long contiguous sequences . Fine-tuning is straightforward since the selfattention mechanism in the Transformer allows BERT to model many downstream taskswhether they involve single text or text pairs-by swapping out the appropriate inputs and outputs . For applications involving text pairs , a common pattern is to independently encode text pairs before applying bidirectional cross attention , such as Parikh et al . 2016 ; Seo et al . ( 2017 ) . BERT instead uses the self-attention mechanism to unify these two stages , as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences . For each task , we simply plug in the taskspecific inputs and outputs into BERT and finetune all the parameters end-to-end . At the input , sentence A and sentence B from pre-training are analogous to ( 1 ) sentence pairs in paraphrasing , ( 2 ) hypothesis-premise pairs in entailment , ( 3 ) question-passage pairs in question answering , and ( 4 ) a degenerate text-∅ pair in text classification or sequence tagging . At the output , the token representations are fed into an output layer for tokenlevel tasks , such as sequence tagging or question answering , and the [ CLS ] representation is fed into an output layer for classification , such as entailment or sentiment analysis . Compared to pre-training , fine-tuning is relatively inexpensive . All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU , or a few hours on a GPU , starting from the exact same pre-trained model . 7 We describe the task-specific details in the corresponding subsections of Section 4 . More details can be found in Appendix A.5 . In this section , we present BERT fine-tuning results on 11 NLP tasks . The General Language Understanding Evaluation ( GLUE ) benchmark ( Wang et al. , 2018a ) is a collection of diverse natural language understanding tasks . Detailed descriptions of GLUE datasets are included in Appendix B.1 . To fine-tune on GLUE , we represent the input sequence ( for single sentence or sentence pairs ) as described in Section 3 , and use the final hidden vector C ∈ R H corresponding to the first input token ( [ CLS ] ) as the aggregate representation . The only new parameters introduced during fine-tuning are classification layer weights W ∈ R K×H , where K is the number of labels . We compute a standard classification loss with C and W , i.e. , log ( softmax ( CW T ) ) . Table 1 : GLUE Test results , scored by the evaluation server ( https : //gluebenchmark.com/leaderboard ) . The number below each task denotes the number of training examples . The `` Average '' column is slightly different than the official GLUE score , since we exclude the problematic WNLI set . 8 BERT and OpenAI GPT are singlemodel , single task . F1 scores are reported for QQP and MRPC , Spearman correlations are reported for STS-B , and accuracy scores are reported for the other tasks . We exclude entries that use BERT as one of their components . We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks . For each task , we selected the best fine-tuning learning rate ( among 5e-5 , 4e-5 , 3e-5 , and 2e-5 ) on the Dev set . Additionally , for BERT LARGE we found that finetuning was sometimes unstable on small datasets , so we ran several random restarts and selected the best model on the Dev set . With random restarts , we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization . 9 Results are presented in Table 1 . Both BERT BASE and BERT LARGE outperform all systems on all tasks by a substantial margin , obtaining 4.5 % and 7.0 % respective average accuracy improvement over the prior state of the art . Note that BERT BASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking . For the largest and most widely reported GLUE task , MNLI , BERT obtains a 4.6 % absolute accuracy improvement . On the official GLUE leaderboard 10 , BERT LARGE obtains a score of 80.5 , compared to OpenAI GPT , which obtains 72.8 as of the date of writing . We find that BERT LARGE significantly outperforms BERT BASE across all tasks , especially those with very little training data . The effect of model size is explored more thoroughly in Section 5.2 . The Stanford Question Answering Dataset ( SQuAD v1.1 ) is a collection of 100k crowdsourced question/answer pairs ( Rajpurkar et al. , 2016 ) . Given a question and a passage from Wikipedia containing the answer , the task is to predict the answer text span in the passage . As shown in Figure 1 , in the question answering task , we represent the input question and passage as a single packed sequence , with the question using the A embedding and the passage using the B embedding . We only introduce a start vector S ∈ R H and an end vector E ∈ R H during fine-tuning . The probability of word i being the start of the answer span is computed as a dot product between T i and S followed by a softmax over all of the words in the paragraph : P i = e S•T i j e S•T j . The analogous formula is used for the end of the answer span . The score of a candidate span from position i to position j is defined as S•T i + E•T j , and the maximum scoring span where j ≥ i is used as a prediction . The training objective is the sum of the log-likelihoods of the correct start and end positions . We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32 . Table 2 shows top leaderboard entries as well as results from top published systems ( Seo et al. , 2017 ; Clark and Gardner , 2018 ; Peters et al. , 2018a ; Hu et al. , 2018 ) . The top results from the SQuAD leaderboard do not have up-to-date public system descriptions available , 11 and are allowed to use any public data when training their systems . We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA ( Joshi et al. , 2017 ) befor fine-tuning on SQuAD . Our best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system . In fact , our single BERT model outperforms the top ensemble system in terms of F1 score . Without TriviaQA fine- tuning data , we only lose 0.1-0.4 F1 , still outperforming all existing systems by a wide margin . 12 The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph , making the problem more realistic . We use a simple approach to extend the SQuAD v1.1 BERT model for this task . We treat questions that do not have an answer as having an answer span with start and end at the [ CLS ] token . The probability space for the start and end answer span positions is extended to include the position of the [ CLS ] token . For prediction , we compare the score of the no-answer span : s null = S•C + E•C to the score of the best non-null span 12 The TriviaQA data we used consists of paragraphs from TriviaQA-Wiki formed of the first 400 tokens in documents , that contain at least one of the provided possible answers . Dev Test ESIM+GloVe s i , j = max j≥i S•T i + E•T j . We predict a non-null answer whenŝ i , j > s null + τ , where the threshold τ is selected on the dev set to maximize F1 . We did not use TriviaQA data for this model . We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48 . The results compared to prior leaderboard entries and top published work ( Sun et al. , 2018 ; Wang et al. , 2018b ) are shown in Table 3 , excluding systems that use BERT as one of their components . We observe a +5.1 F1 improvement over the previous best system . The Situations With Adversarial Generations ( SWAG ) dataset contains 113k sentence-pair completion examples that evaluate grounded commonsense inference ( Zellers et al. , 2018 ) . Given a sentence , the task is to choose the most plausible continuation among four choices . When fine-tuning on the SWAG dataset , we construct four input sequences , each containing the concatenation of the given sentence ( sentence A ) and a possible continuation ( sentence B ) . The only task-specific parameters introduced is a vector whose dot product with the [ CLS ] token representation C denotes a score for each choice which is normalized with a softmax layer . We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16 . Results are presented in Table 4 . BERT LARGE outperforms the authors ' baseline ESIM+ELMo system by +27.1 % and OpenAI GPT by 8.3 % . In this section , we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance . Additional Table 5 : Ablation over the pre-training tasks using the BERT BASE architecture . `` No NSP '' is trained without the next sentence prediction task . `` LTR & No NSP '' is trained as a left-to-right LM without the next sentence prediction , like OpenAI GPT . `` + BiLSTM '' adds a randomly initialized BiLSTM on top of the `` LTR + No NSP '' model during fine-tuning . ablation studies can be found in Appendix C . We demonstrate the importance of the deep bidirectionality of BERT by evaluating two pretraining objectives using exactly the same pretraining data , fine-tuning scheme , and hyperparameters as BERT BASE : No NSP : A bidirectional model which is trained using the `` masked LM '' ( MLM ) but without the `` next sentence prediction '' ( NSP ) task . A left-context-only model which is trained using a standard Left-to-Right ( LTR ) LM , rather than an MLM . The left-only constraint was also applied at fine-tuning , because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance . Additionally , this model was pre-trained without the NSP task . This is directly comparable to OpenAI GPT , but using our larger training dataset , our input representation , and our fine-tuning scheme . We first examine the impact brought by the NSP task . In Table 5 , we show that removing NSP hurts performance significantly on QNLI , MNLI , and SQuAD 1.1 . Next , we evaluate the impact of training bidirectional representations by comparing `` No NSP '' to `` LTR & No NSP '' . The LTR model performs worse than the MLM model on all tasks , with large drops on MRPC and SQuAD . For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions , since the token-level hidden states have no rightside context . In order to make a good faith attempt at strengthening the LTR system , we added a randomly initialized BiLSTM on top . This does significantly improve results on SQuAD , but the results are still far worse than those of the pretrained bidirectional models . The BiLSTM hurts performance on the GLUE tasks . We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models , as ELMo does . However : ( a ) this is twice as expensive as a single bidirectional model ; ( b ) this is non-intuitive for tasks like QA , since the RTL model would not be able to condition the answer on the question ; ( c ) this it is strictly less powerful than a deep bidirectional model , since it can use both left and right context at every layer . In this section , we explore the effect of model size on fine-tuning task accuracy . We trained a number of BERT models with a differing number of layers , hidden units , and attention heads , while otherwise using the same hyperparameters and training procedure as described previously . Results on selected GLUE tasks are shown in Table 6 . In this table , we report the average Dev Set accuracy from 5 random restarts of fine-tuning . We can see that larger models lead to a strict accuracy improvement across all four datasets , even for MRPC which only has 3,600 labeled training examples , and is substantially different from the pre-training tasks . It is also perhaps surprising that we are able to achieve such significant improvements on top of models which are already quite large relative to the existing literature . For example , the largest Transformer explored in Vaswani et al . ( 2017 ) is ( L=6 , H=1024 , A=16 ) with 100M parameters for the encoder , and the largest Transformer we have found in the literature is ( L=64 , H=512 , A=2 ) with 235M parameters ( Al-Rfou et al. , 2018 ) . By contrast , BERT BASE contains 110M parameters and BERT LARGE contains 340M parameters . It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling , which is demonstrated by the LM perplexity of held-out training data shown in Table 6 . However , we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks , provided that the model has been sufficiently pre-trained . Peters et al . ( 2018b ) presented mixed results on the downstream task impact of increasing the pre-trained bi-LM size from two to four layers and Melamud et al . ( 2016 ) mentioned in passing that increasing hidden dimension size from 200 to 600 helped , but increasing further to 1,000 did not bring further improvements . Both of these prior works used a featurebased approach -we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of randomly initialized additional parameters , the taskspecific models can benefit from the larger , more expressive pre-trained representations even when downstream task data is very small . All of the BERT results presented so far have used the fine-tuning approach , where a simple classification layer is added to the pre-trained model , and all parameters are jointly fine-tuned on a downstream task . However , the feature-based approach , where fixed features are extracted from the pretrained model , has certain advantages . First , not all tasks can be easily represented by a Transformer encoder architecture , and therefore require a task-specific model architecture to be added . Second , there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation . In this section , we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition ( NER ) task ( Tjong Kim Sang and De Meulder , 2003 ) . In the input to BERT , we use a case-preserving WordPiece model , and we include the maximal document context provided by the data . Following standard practice , we formulate this as a tagging task but do not use a CRF Table 6 : Ablation over BERT model size . # L = the number of layers ; # H = hidden size ; # A = number of attention heads . `` LM ( ppl ) '' is the masked LM perplexity of held-out training data . Dev F1 Test F1 ELMo ( Peters et al. , 2018a ) 95.7 92.2 CVT -92.6 CSE ( Akbik et al. , 2018 layer in the output . We use the representation of the first sub-token as the input to the token-level classifier over the NER label set . To ablate the fine-tuning approach , we apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT . These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer . Results are presented in Table 7 . BERT LARGE performs competitively with state-of-the-art methods . The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer , which is only 0.3 F1 behind fine-tuning the entire model . This demonstrates that BERT is effective for both finetuning and feature-based approaches . Recent empirical improvements due to transfer learning with language models have demonstrated that rich , unsupervised pre-training is an integral part of many language understanding systems . In particular , these results enable even low-resource tasks to benefit from deep unidirectional architectures . Our major contribution is further generalizing these findings to deep bidirectional architectures , allowing the same pre-trained model to successfully tackle a broad set of NLP tasks . Masked LM and the Masking Procedure Assuming the unlabeled sentence is my dog is hairy , and during the random masking procedure we chose the 4-th token ( which corresponding to hairy ) , our masking procedure can be further illustrated by • 10 % of the time : Replace the word with a random word , e.g. , my dog is hairy → my dog is apple • 10 % of the time : Keep the word unchanged , e.g. , my dog is hairy → my dog is hairy . The purpose of this is to bias the representation towards the actual observed word . The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words , so it is forced to keep a distributional contextual representation of every input token . Additionally , because random replacement only occurs for 1.5 % of all tokens ( i.e. , 10 % of 15 % ) , this does not seem to harm the model 's language understanding capability . In Section C.2 , we evaluate the impact this procedure . Compared to standard langauge model training , the masked LM only make predictions on 15 % of tokens in each batch , which suggests that more pre-training steps may be required for the model to converge . In Section C.1 we demonstrate that MLM does converge marginally slower than a leftto-right model ( which predicts every token ) , but the empirical improvements of the MLM model far outweigh the increased training cost . T 1 T 2 T N ... ... ... ... ... E 1 E 2 E N ... T 1 T 2 T N ... E 1 E 2 E N ... T 1 T 2 T N ... E 1 E 2 E N ... Next Sentence Prediction The next sentence prediction task can be illustrated in the following examples . To generate each training input sequence , we sample two spans of text from the corpus , which we refer to as `` sentences '' even though they are typically much longer than single sentences ( but can be shorter also ) . The first sentence receives the A embedding and the second receives the B embedding . 50 % of the time B is the actual next sentence that follows A and 50 % of the time it is a random sentence , which is done for the `` next sentence prediction '' task . They are sampled such that the combined length is ≤ 512 tokens . The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15 % , and no special consideration given to partial word pieces . We train with batch size of 256 sequences ( 256 sequences * 512 tokens = 128,000 tokens/batch ) for 1,000,000 steps , which is approximately 40 epochs over the 3.3 billion word corpus . We use Adam with learning rate of 1e-4 , β 1 = 0.9 , β 2 = 0.999 , L2 weight decay of 0.01 , learning rate warmup over the first 10,000 steps , and linear decay of the learning rate . We use a dropout probability of 0.1 on all layers . We use a gelu activation ( Hendrycks and Gimpel , 2016 ) rather than the standard relu , following OpenAI GPT . The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood . Training of BERT BASE was performed on 4 Cloud TPUs in Pod configuration ( 16 TPU chips total ) . 13 Training of BERT LARGE was performed on 16 Cloud TPUs ( 64 TPU chips total ) . Each pretraining took 4 days to complete . Longer sequences are disproportionately expensive because attention is quadratic to the sequence length . To speed up pretraing in our experiments , we pre-train the model with sequence length of 128 for 90 % of the steps . Then , we train the rest 10 % of the steps of sequence of 512 to learn the positional embeddings . For fine-tuning , most model hyperparameters are the same as in pre-training , with the exception of the batch size , learning rate , and number of training epochs . The dropout probability was always kept at 0.1 . The optimal hyperparameter values are task-specific , but we found the following range of possible values to work well across all tasks : • Batch size : 16 , 32 • Learning rate ( Adam ) : 5e-5 , 3e-5 , 2e-5 • Number of epochs : 2 , 3 , 4 We also observed that large data sets ( e.g. , 100k+ labeled training examples ) were far less sensitive to hyperparameter choice than small data sets . Fine-tuning is typically very fast , so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set . OpenAI GPT Here we studies the differences in recent popular representation learning models including ELMo , OpenAI GPT and BERT . The comparisons between the model architectures are shown visually in Figure 3 . Note that in addition to the architecture differences , BERT and OpenAI GPT are finetuning approaches , while ELMo is a feature-based approach . The most comparable existing pre-training method to BERT is OpenAI GPT , which trains a left-to-right Transformer LM on a large text corpus . In fact , many of the design decisions in BERT were intentionally made to make it as close to GPT as possible so that the two methods could be minimally compared . The core argument of this work is that the bi-directionality and the two pretraining tasks presented in Section 3.1 account for the majority of the empirical improvements , but we do note that there are several other differences between how BERT and GPT were trained : • GPT is trained on the BooksCorpus ( 800M words ) ; BERT is trained on the BooksCorpus ( 800M words ) and Wikipedia ( 2,500M words ) . • • GPT was trained for 1M steps with a batch size of 32,000 words ; BERT was trained for 1M steps with a batch size of 128,000 words . • GPT used the same learning rate of 5e-5 for all fine-tuning experiments ; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set . To isolate the effect of these differences , we perform ablation experiments in Section 5.1 which demonstrate that the majority of the improvements are in fact coming from the two pre-training tasks and the bidirectionality they enable . The illustration of fine-tuning BERT on different tasks can be seen in Figure 4 . Our task-specific models are formed by incorporating BERT with one additional output layer , so a minimal number of parameters need to be learned from scratch . Among the tasks , Our GLUE results in Table1 are obtained from https : //gluebenchmark.com/ leaderboard and https : //blog . openai.com/language-unsupervised . The GLUE benchmark includes the following datasets , the descriptions of which were originally summarized in Wang et al . ( 2018a ) : MNLI Multi-Genre Natural Language Inference is a large-scale , crowdsourced entailment classification task ( Williams et al. , 2018 ) . Given a pair of sentences , the goal is to predict whether the second sentence is an entailment , contradiction , or neutral with respect to the first one . QQP Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent . QNLI Question Natural Language Inference is a version of the Stanford Question Answering Dataset ( Rajpurkar et al. , 2016 ) which has been converted to a binary classification task ( Wang et al. , 2018a ) . The positive examples are ( question , sentence ) pairs which do contain the correct answer , and the negative examples are ( question , sentence ) from the same paragraph which do not contain the answer . BERT E [ CLS ] E 1 E [ SEP ] . .. E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... T M ' [ CLS ] Tok 1 [ SEP ] ... Tok N Tok 1 ... Tok M Question Paragraph BERT E [ CLS ] E 1 E 2 E N C T 1 T 2 T N Single Sentence ... ... BERT Tok 1 Tok 2 Tok N ... [ CLS ] E [ CLS ] E 1 E 2 E N C T 1 T 2 T N Single Sentence B-PER O O ... ... E [ CLS ] E 1 E [ SEP ] Class Label ... E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment ( Socher et al. , 2013 ) . CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task , where the goal is to predict whether an English sentence is linguistically `` acceptable '' or not ( Warstadt et al. , 2018 ) . The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources ( Cer et al. , 2017 ) . They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning . MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources , with human annotations for whether the sentences in the pair are semantically equivalent ( Dolan and Brockett , 2005 ) . RTE Recognizing Textual Entailment is a binary entailment task similar to MNLI , but with much less training data ( Bentivogli et al. , 2009 ) . 14 WNLI Winograd NLI is a small natural language inference dataset ( Levesque et al. , 2011 ) . The GLUE webpage notes that there are issues with the construction of this dataset , 15 and every trained system that 's been submitted to GLUE has performed worse than the 65.1 baseline accuracy of predicting the majority class . We therefore exclude this set to be fair to OpenAI GPT . For our GLUE submission , we always predicted the ma-jority class . C.1 Effect of Number of Training Steps Figure 5 presents MNLI Dev accuracy after finetuning from a checkpoint that has been pre-trained for k steps . This allows us to answer the following questions : 1 . Question : Does BERT really need such a large amount of pre-training ( 128,000 words/batch * 1,000,000 steps ) to achieve high fine-tuning accuracy ? Answer : Yes , BERT BASE achieves almost 1.0 % additional accuracy on MNLI when trained on 1M steps compared to 500k steps . 2 . Question : Does MLM pre-training converge slower than LTR pre-training , since only 15 % of words are predicted in each batch rather than every word ? Answer : The MLM model does converge slightly slower than the LTR model . However , in terms of absolute accuracy the MLM model begins to outperform the LTR model almost immediately . In Section 3.1 , we mention that BERT uses a mixed strategy for masking the target tokens when pre-training with the masked language model ( MLM ) objective . The following is an ablation study to evaluate the effect of different masking strategies . Note that the purpose of the masking strategies is to reduce the mismatch between pre-training and fine-tuning , as the [ MASK ] symbol never appears during the fine-tuning stage . We report the Dev results for both MNLI and NER . For NER , we report both fine-tuning and feature-based approaches , as we expect the mismatch will be amplified for the feature-based approach as the model will not have the chance to adjust the representations . The results are presented in Table 8 . In the table , MASK means that we replace the target token with the [ MASK ] symbol for MLM ; SAME means that we keep the target token as is ; RND means that we replace the target token with another random token . The numbers in the left part of the table represent the probabilities of the specific strategies used during MLM pre-training ( BERT uses 80 % , 10 % , 10 % ) . The right part of the paper represents the Dev set results . For the feature-based approach , we concatenate the last 4 layers of BERT as the features , which was shown to be the best approach in Section 5.3 . From the table it can be seen that fine-tuning is surprisingly robust to different masking strategies . However , as expected , using only the MASK strategy was problematic when applying the featurebased approach to NER . Interestingly , using only the RND strategy performs much worse than our strategy as well . https : //github.com/tensorflow/tensor2tensor 2 http : //nlp.seas.harvard.edu/2018/04/03/attention.html 3 In all cases we set the feed-forward/filter size to be 4H , i.e. , 3072 for the H = 768 and 4096 for the H = 1024 . The final model achieves 97 % -98 % accuracy on NSP.6 The vector C is not a meaningful sentence representation without fine-tuning , since it was trained with NSP . For example , the BERT SQuAD model can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0 % .8 See ( 10 ) in https : //gluebenchmark.com/faq . The GLUE data set distribution does not include the Test labels , and we only made a single GLUE evaluation server submission for each of BERTBASE and BERTLARGE.10 https : //gluebenchmark.com/leaderboard QANet is described inYu et al . ( 2018 ) , but the system has improved substantially after publication . https : //cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-globalavailability.html Note that we only report single-task fine-tuning results in this paper . A multitask fine-tuning approach could potentially push the performance even further . For example , we did observe substantial improvements on RTE from multitask training with MNLI.15 https : //gluebenchmark.com/faq
examples/N18-3011_body.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ The goal of this work is to facilitate algorithmic discovery in the scientific literature . Despite notable advances in scientific search engines , data mining and digital libraries ( e.g. , Wu et al. , 2014 ) , researchers remain unable to answer simple questions such as : What is the percentage of female subjects in depression clinical trials ? Which of my co-authors published one or more papers on coreference resolution ? Which papers discuss the effects of Ranibizumab on the Retina ? In this paper , we focus on the problem of extracting structured data from scientific documents , which can later be used in natural language interfaces ( e.g. , Iyer et al. , 2017 ) or to improve ranking of results in academic search ( e.g. , Xiong et al. , Figure 1 : Part of the literature graph . 2017 ) . We describe methods used in a scalable deployed production system for extracting structured information from scientific documents into the literature graph ( see Fig . 1 ) . The literature graph is a directed property graph which summarizes key information in the literature and can be used to answer the queries mentioned earlier as well as more complex queries . For example , in order to compute the Erdős number of an author X , the graph can be queried to find the number of nodes on the shortest undirected path between author X and Paul Erdős such that all edges on the path are labeled `` authored '' . We reduce literature graph construction into familiar NLP tasks such as sequence labeling , entity linking and relation extraction , and address some of the impractical assumptions commonly made in the standard formulations of these tasks . For example , most research on named entity recognition tasks report results on large labeled datasets such as CoNLL-2003 and ACE-2005 ( e.g. , Lample et al. , 2016 , and assume that entity types in the test set match those labeled in the training set ( including work on domain adaptation , e.g. , Daumé , 2007 ) . These assumptions , while useful for developing and benchmarking new methods , are unrealistic for many domains and applications . The paper also serves as an overview of the approach we adopt at www.semanticscholar.org in a step towards more intelligent academic search engines ( Etzioni , 2011 ) . In the next section , we start by describing our symbolic representation of the literature . Then , we discuss how we extract metadata associated with a paper such as authors and references , then how we extract the entities mentioned in paper text . Before we conclude , we briefly describe other research challenges we are actively working on in order to improve the quality of the literature graph . The literature graph is a property graph with directed edges . Unlike Resource Description Framework ( RDF ) graphs , nodes and edges in property graphs have an internal structure which is more suitable for representing complex data types such as papers and entities . In this section , we describe the attributes associated with nodes and edges of different types in the literature graph . Papers . We obtain metadata and PDF files of papers via partnerships with publishers ( e.g. , Springer , Nature ) , catalogs ( e.g. , DBLP , MED-LINE ) , pre-publishing services ( e.g. , arXiv , bioRxive ) , as well as web-crawling . Paper nodes are associated with a set of attributes such as 'title ' , 'abstract ' , 'full text ' , 'venues ' and 'publication year ' . While some of the paper sources provide these attributes as metadata , it is often necessary to extract them from the paper PDF ( details in §3 ) . We deterministically remove duplicate papers based on string similarity of their metadata , resulting in 37M unique paper nodes . Papers in the literature graph cover a variety of scientific disciplines , including computer science , molecular biology , microbiology and neuroscience . Authors . Each node of this type represents a unique author , with attributes such as 'first name ' and 'last name ' . The literature graph has 12M nodes of this type . Entities . Each node of this type represents a unique scientific concept discussed in the literature , with attributes such as 'canonical name ' , 'aliases ' and 'description ' . Our literature graph has 0.4M nodes of this type . We describe how we populate entity nodes in §4.3 . Entity mentions . Each node of this type represents a textual reference of an entity in one of the papers , with attributes such as 'mention text ' , 'context ' , and 'confidence ' . We describe how we populate the 237M mentions in the literature graph in §4.1 . Citations . We instantiate a directed citation edge from paper nodes p 1 ! p 2 for each p 2 referenced in p 1 . Citation edges have attributes such as 'from paper id ' , 'to paper id ' and 'contexts ' ( the textual contexts where p 2 is referenced in p 1 ) . While some of the paper sources provide these attributes as metadata , it is often necessary to extract them from the paper PDF as detailed in §3 . Authorship . We instantiate a directed authorship edge between an author node and a paper node a ! p for each author of that paper . Entity linking edges . We instantiate a directed edge from an extracted entity mention node to the entity it refers to . Mention-mention relations . We instantiate a directed edge between a pair of mentions in the same sentential context if the textual relation extraction model predicts one of a predefined list of relation types between them in a sentential context . 1 We encode a symmetric relation between m 1 and m 2 as two directed edges m 1 ! m 2 and m 2 ! m 1 . Entity-entity relations . While mentionmention edges represent relations between mentions in a particular context , entity-entity edges represent relations between abstract entities . These relations may be imported from an existing knowledge base ( KB ) or inferred from other edges in the graph . In the previous section , we described the overall structure of the literature graph . Next , we discuss how we populate paper nodes , author nodes , authorship edges , and citation edges . Although some publishers provide sufficient metadata about their papers , many papers are provided with incomplete metadata . Also , papers obtained via web-crawling are not associated with any metadata . To fill in this gap , we built the Sci-enceParse system to predict structured data from the raw PDFs using recurrent neural networks ( RNNs ) . 2 For each paper , the system extracts the paper title , list of authors , and list of references ; each reference consists of a title , a list of authors , a venue , and a year . Preparing the input layer . We split each PDF into individual pages , and feed each page to Apache 's PDFBox library 3 to convert it into a sequence of tokens , where each token has features , e.g. , 'text ' , 'font size ' , 'space width ' , 'position on the page ' . We normalize the token-level features before feeding them as inputs to the model . For each of the 'font size ' and 'space width ' features , we compute three normalized values ( with respect to current page , current document , and the whole training corpus ) , each value ranging between -0.5 to +0.5 . The token 's 'position on the page ' is given in XY coordinate points . We scale the values linearly to range from . 0:5 ; 0:5/ at the top-left corner of the page to .0:5 ; 0:5/ at the bottom-right corner . In order to capture case information , we add seven numeric features to the input representation of each token : whether the first/second letter is uppercase/lowercase , the fraction of uppercase/lowercase letters and the fraction of digits . To help the model make correct predictions for metadata which tend to appear at the beginning ( e.g. , titles and authors ) or at the end of papers ( e.g. , references ) , we provide the current page number as two discrete variables ( relative to the beginning and end of the PDF file ) with values 0 , 1 and 2+ . These features are repeated for each token on the same page . For the k-th token in the sequence , we compute the input representation i k by concatenating the numeric features , an embedding of the 'font size ' , and the word embedding of the lowercased token . Word embeddings are initialized with GloVe ( Pennington et al. , 2014 ) . Model . The input token representations are passed through one fully-connected layer and then g ! k D LSTM.Wi k ; g ! k 1 / ; g k D OEg ! k I g k ; h ! k D LSTM.g k ; h ! k 1 / ; h k D OEh ! k I g k where W is a weight matrix , g k and h k are defined similarly to g ! k and h ! k but process token sequences in the opposite direction . Following Collobert et al . 2011 , we feed the output of the second layer h k into a dense layer to predict unnormalized label weights for each token and learn label bigram feature weights ( often described as a conditional random field layer when used in neural architectures ) to account for dependencies between labels . Training . The ScienceParse system is trained on a snapshot of the data at PubMed Central . It consists of 1.4M PDFs and their associated metadata , which specify the correct titles , authors , and bibliographies . We use a heuristic labeling process that finds the strings from the metadata in the tokenized PDFs to produce labeled tokens . This labeling process succeeds for 76 % of the documents . The remaining documents are not used in the training process . During training , we only use pages which have at least one token with a label that is not `` none '' . Decoding . At test time , we use Viterbi decoding to find the most likely global sequence , with no further constraints . To get the title , we use the longest continuous sequence of tokens with the `` title '' label . Since there can be multiple authors , we use all continuous sequences of tokens with the `` author '' label as authors , but require that all authors of a paper are mentioned on the same page . If the author labels are predicted in multiple pages , we use the one with the largest number of authors . Results . We run our final tests on a held-out set from PubMed Central , consisting of about 54K documents . The results are detailed in Table 1 . We use a conservative evaluation where an instance is correct if it exactly matches the gold annotation , with no credit for partial matching . To give an example for the type of errors our model makes , consider the paper ( Wang et al. , 2013 ) titled `` Clinical review : Efficacy of antimicrobial-impregnated catheters in external ventricular drainage -a systematic review and metaanalysis . '' The title we extract for this paper omits the first part `` Clinical review : '' . This is likely to be a result of the pattern `` Foo : Bar Baz '' appearing in many training examples with only `` Bar Baz '' labeled as the title . In the previous section , we described how we populate the backbone of the literature graph , i.e. , paper nodes , author nodes and citation edges . Next , we discuss how we populate mentions and entities in the literature graph using entity extraction and linking on the paper text . In order to focus on more salient entities in a given paper , we only use the title and abstract . We experiment with three approaches for entity extraction and linking : I . Statistical : uses one or more statistical models for predicting mention spans , then uses another statistical model to link mentions to candidate entities in a KB . II . Hybrid : defines a small number of handengineered , deterministic rules for string-based matching of the input text to candidate entities in the KB , then uses a statistical model to disambiguate the mentions . 4 III . Off-the-shelf : uses existing libraries , namely ( Ferragina and Scaiella , 2010 , TagMe ) 5 and ( Demner-Fushman et al. , 2017 , MetaMap Lite ) 6 , with minimal post-processing to extract and link entities to the KB . Table 2 : Document-level evaluation of three approaches in two scientific areas : computer science ( CS ) and biomedical ( Bio ) . We evaluate the performance of each approach in two broad scientific areas : computer science ( CS ) and biomedical research ( Bio ) . For each unique ( paper ID , entity ID ) pair predicted by one of the approaches , we ask human annotators to label each mention extracted for this entity in the paper . We use CrowdFlower to manage human annotations and only include instances where three or more annotators agree on the label . If one or more of the entity mentions in that paper is judged to be correct , the pair ( paper ID , entity ID ) counts as one correct instance . Otherwise , it counts as an incorrect instance . We report 'yield ' in lieu of 'recall ' due to the difficulty of doing a scalable comprehensive annotation . Table 2 shows the results based on 500 papers using v1.1.2 of our entity extraction and linking components . In both domains , the statistical approach gives the highest precision and the lowest yield . The hybrid approach consistently gives the highest yield , but sacrifices precision . The TagMe off-the-shelf library used for the CS domain gives surprisingly good results , with precision within 1 point from the statistical models . However , the MetaMap Lite off-the-shelf library we used for the biomedical domain suffered a huge loss in precision . Our error analysis showed that each of the approaches is able to predict entities not predicted by the other approaches so we decided to pool their outputs in our deployed system , which gives significantly higher yield than any individual approach while maintaining reasonably high precision . Given the token sequence t 1 ; : : : ; t N in a sentence , we need to identify spans which correspond to entity mentions . We use the BILOU scheme to encode labels at the token level . Unlike most formulations of named entity recognition problems ( NER ) , we do not identify the entity type ( e.g. , protein , drug , chemical , disease ) for each mention since the output mentions are further grounded in a KB with further information about the entity ( including its type ) , using an entity linking module . Model . First , we construct the token embedding x k D OEc k I w k for each token t k in the input sequence , where c k is a character-based representation computed using a convolutional neural network ( CNN ) with filter of size 3 characters , and w k are learned word embeddings initialized with the GloVe embeddings ( Pennington et al. , 2014 ) . We also compute context-sensitive word embeddings , denoted as lm k D OElm ! k I lm k , by concatenating the projected outputs of forward and backward recurrent neural network language models ( RNN-LM ) at position k. The language model ( LM ) for each direction is trained independently and consists of a single layer long short-term memory ( LSTM ) network followed by a linear project layer . While training the LM parameters , lm ! k is used to predict t kC1 and lm k is used to predict t k 1 . We fix the LM parameters during training of the entity extraction model . See and for more details . Given the x k and lm k embeddings for each token k 2 f1 ; : : : ; N g , we use a two-layer bidirectional LSTM to encode the sequence with x k and lm k feeding into the first and second layer , respectively . That is , g ! k D LSTM.x k ; g ! k 1 / ; g k D OEg ! k I g k ; h ! k D LSTM.OEg k I lm k ; h ! k 1 / ; h k D OEh ! k I h k ; where g k and h k are defined similarly to g ! k and h ! k but process token sequences in the opposite direction . Similar to the model described in §3 , we feed the output of the second LSTM into a dense layer to predict unnormalized label weights for each token and learn label bigram feature weights to account for dependencies between labels . Results . We use the standard data splits of the SemEval-2017 Task 10 on entity ( and relation ) extraction from scientific papers ( Augenstein et al. , 2017 ) . Table 3 compares three variants of our entity extraction model . The first line omits the LM embeddings lm k , while the second line is the full model ( including LM embeddings ) showing a large improvement of 4.2 F1 points . The third line shows that creating an ensemble of 15 models further improves the results by 1.1 F1 points . Model instances . In the deployed system , we use three instances of the entity extraction model Description F1 Without LM 49.9 With LM 54.1 Avg . of 15 models with LM 55.2 Table 3 : Results of the entity extraction model on the development set of SemEval-2017 task 10. with a similar architecture , but trained on different datasets . Two instances are trained on the BC5CDR ( Li et al. , 2016 ) and the CHEMDNER datasets ( Krallinger et al. , 2015 ) to extract key entity mentions in the biomedical domain such as diseases , drugs and chemical compounds . The third instance is trained on mention labels induced from Wikipedia articles in the computer science domain . The output of all model instances are pooled together and combined with the rule-based entity extraction module , then fed into the entity linking model ( described below ) . In this section , we describe the construction of entity nodes and entity-entity edges . Unlike other knowledge extraction systems such as the Never-Ending Language Learner ( NELL ) 7 and OpenIE 4 , 8 we use existing knowledge bases ( KBs ) of entities to reduce the burden of identifying coherent concepts . Grounding the entity mentions in a manually-curated KB also increases user confidence in automated predictions . We use two KBs : UMLS : The UMLS metathesaurus integrates information about concepts in specialized ontologies in several biomedical domains , and is funded by the U.S. National Library of Medicine . DBpedia : DBpedia provides access to structured information in Wikipedia . Rather than including all Wikipedia pages , we used a short list of Wikipedia categories about CS and included all pages up to depth four in their trees in order to exclude irrelevant entities , e.g. , `` Lord of the Rings '' in DBpedia . Given a text span s identified by the entity extraction model in §4.2 ( or with heuristics ) and a reference KB , the goal of the entity linking model is to associate the span with the entity it refers to . A span and its surrounding words are collectively referred to as a mention . We first identify a set of candidate entities that a given mention may refer to . Then , we rank the candidate entities based on a score computed using a neural model trained on labeled data . For example , given the string `` . . . database of facts , an ILP system will . . . `` , the entity extraction model identifies the span `` ILP '' as a possible entity and the entity linking model associates it with `` Inductive_Logic_Programming '' as the referent entity ( from among other candidates like `` Integer_Linear_Programming '' or `` Instruction-level_Parallelism '' ) . Datasets . We used two datasets : i ) a biomedical dataset formed by combining MSH ( Jimeno-Yepes et al. , 2011 ) and BC5CDR ( Li et al. , 2016 ) with UMLS as the reference KB , and ii ) a CS dataset we curated using Wikipedia articles about CS concepts with DBpedia as the reference KB . Candidate selection . In a preprocessing step , we build an index which maps any token used in a labeled mention or an entity name in the KB to associated entity IDs , along with the frequency this token is associated with that entity . This is similar to the index used in previous entity linking systems ( e.g. , Bhagavatula et al. , 2015 ) to estimate the probability that a given mention refers to an entity . At train and test time , we use this index to find candidate entities for a given mention by looking up the tokens in the mention . This method also serves as our baseline in Table 4 by selecting the entity with the highest frequency for a given mention . Scoring candidates . Given a mention ( m ) and a candidate entity ( e ) , the neural model constructs a vector encoding of the mention and the entity . We encode the mention and entity using the functions f and g , respectively , as follows : f.m/ D OEv m.name I avg.v m.lc ; v m.rc / ; g.e/ D OEv e.name I v e.def ; where m.surface , m.lc and m.rc are the mention 's surface form , left and right contexts , and e.name and e.def are the candidate entity 's name and definition , respectively . v text is a bag-of-words sum encoder for text . We use the same encoder for the mention surface form and the candidate name , and another encoder for the mention contexts and entity definition . Additionally , we include numerical features to estimate the confidence of a candidate entity based on the statistics collected in the index described Table 4 : The Bag of Concepts F1 score of the baseline and neural model on the two curated datasets . earlier . We compute two scores based on the word overlap of ( i ) mention 's context and candidate 's definition and ( ii ) mention 's surface span and the candidate entity 's name . Finally , we feed the concatenation of the cosine similarity between f.m/ and g.e/ and the intersection-based scores into an affine transformation followed by a sigmoid nonlinearity to compute the final score for the pair ( m , e ) . Results . We use the Bag of Concepts F1 metric ( Ling et al. , 2015 ) for comparison . Table 4 compares the performance of the most-frequent-entity baseline and our neural model described above . In the previous sections , we discussed how we construct the main components of the literature graph . In this section , we briefly describe several other related challenges we are actively working on . Author disambiguation . Despite initiatives to have global author IDs ORCID and ResearcherID , most publishers provide author information as names ( e.g. , arXiv ) . However , author names can not be used as a unique identifier since several people often share the same name . Moreover , different venues and sources use different conventions in reporting the author names , e.g. , `` first initial , last name '' vs. `` last name , first name '' . Inspired by Culotta et al . ( 2007 ) , we train a supervised binary classifier for merging pairs of author instances and use it to incrementally create author clusters . We only consider merging two author instances if they have the same last name and share the first initial . If the first name is spelled out ( rather than abbreviated ) in both author instances , we also require that the first name matches . Ontology matching . Popular concepts are often represented in multiple KBs . For example , the concept of `` artificial neural networks '' is represented as entity ID D016571 in the MESH ontology , and represented as page ID '21523 ' in DBpedia . Ontology matching is the problem of identifying semantically-equivalent entities across KBs or ontologies . 9 Limited KB coverage . The convenience of grounding entities in a hand-curated KB comes at the cost of limited coverage . Introduction of new concepts and relations in the scientific literature occurs at a faster pace than KB curation , resulting in a large gap in KB coverage of scientific concepts . In order to close this gap , we need to develop models which can predict textual relations as well as detailed concept descriptions in scientific papers . For the same reasons , we also need to augment the relations imported from the KB with relations extracted from text . Our approach to address both entity and relation coverage is based on distant supervision ( Mintz et al. , 2009 ) . In short , we train two models for identifying entity definitions and relations expressed in natural language in scientific documents , and automatically generate labeled data for training these models using known definitions and relations in the KB . We note that the literature graph currently lacks coverage for important entity types ( e.g. , affiliations ) and domains ( e.g. , physics ) . Covering affiliations requires small modifications to the metadata extraction model followed by an algorithm for matching author names with their affiliations . In order to cover additional scientific domains , more agreements need to be signed with publishers . Figure and table extraction . Non-textual components such as charts , diagrams and tables provide key information in many scientific documents , but the lack of large labeled datasets has impeded the development of data-driven methods for scientific figure extraction . In Siegel et al . ( 2018 ) , we induced high-quality training labels for the task of figure extraction in a large number of scientific documents , with no human intervention . To accomplish this we leveraged the auxiliary data provided in two large web collections of scientific documents ( arXiv and PubMed ) to locate figures and their associated captions in the rasterized PDF . We use the resulting dataset to train a deep neural network for end-to-end figure detection , yielding a model that can be more easily extended to new domains compared to previous work . Understanding and predicting citations . The citation edges in the literature graph provide a wealth of information ( e.g. , at what rate a paper is being cited and whether it is accelerating ) , and opens the door for further research to better understand and predict citations . For example , in order to allow users to better understand what impact a paper had and effectively navigate its citations , we experimented with methods for classifying a citation as important or incidental , as well as more finegrained classes ( Valenzuela et al. , 2015 ) . The citation information also enables us to develop models for estimating the potential of a paper or an author . In Weihs and Etzioni ( 2017 ) , we predict citationbased metrics such as an author 's h-index and the citation rate of a paper in the future . Also related is the problem of predicting which papers should be cited in a given draft ( Bhagavatula et al. , 2018 ) , which can help improve the quality of a paper draft before it is submitted for peer review , or used to supplement the list of references after a paper is published . In this paper , we discuss the construction of a graph , providing a symbolic representation of the scientific literature . We describe deployed models for identifying authors , references and entities in the paper text , and provide experimental results to evaluate the performance of each model . Three research directions follow from this work and other similar projects , e.g. , Hahn-Powell et al . ( 2017 ) ; Wu et al . ( 2014 ) : i ) improving quality and enriching content of the literature graph ( e.g. , ontology matching and knowledge base population ) . ii ) aggregating domain-specific extractions across many papers to enable a better understanding of the literature as a whole ( e.g. , identifying demographic biases in clinical trial participants and summarizing empirical results on important tasks ) . iii ) exploring the literature via natural language interfaces . In order to help future research efforts , we make the following resources publicly available : metadata for over 20 million papers , 10 meaningful citations dataset , 11 models for figure and table extraction , 12 models for predicting citations in a paper draft 13 and models for extracting paper metadata , 14 among other resources . 15 Due to space constraints , we opted not to discuss our relation extraction models in this draft . The ScienceParse libraries can be found at http : // allenai.org/software/.3 https : //pdfbox.apache.org We also experimented with a `` pure '' rules-based approach which disambiguates deterministically but the hybrid approach consistently gave better results.5 The TagMe APIs are described at https : //sobigdata . d4science.org/web/tagme/tagme-help6 We use v3.4 ( L0 ) of MetaMap Lite , available at https : //metamap.nlm.nih.gov/MetaMapLite.shtml http : //rtw.ml.cmu.edu/rtw/ 8 https : //github.com/allenai/ openie-standalone Variants of this problem are also known as deduplication or record linkage .
examples/N18-3011_ref.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).
2
+ Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew D. McCallum. 2017. Semeval 2017 task 10 (scienceie): Extracting keyphrases and relations from scientific publications. In ACL workshop (SemEval).
3
+ Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-based citation recommendation. In NAACL.
4
+ Chandra Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: entity linking in web tables. In ISWC.
5
+ Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. In JMLR.
6
+ Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew D. McCallum. 2007. Author disambiguation using error-driven machine learning with a ranking loss function. In IIWeb Workshop.
7
+ Hal Daumé. 2007. Frustratingly easy domain adapta- tion. In ACL.
8
+ Dina Demner-Fushman, Willie J. Rogers, and Alan R. Aronson. 2017. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. In JAMIA.
9
+ Oren Etzioni. 2011. Search needs a shake-up. Nature 476 7358:25-6.
10
+ Paolo Ferragina and Ugo Scaiella. 2010. TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). In CIKM.
11
+ Gus Hahn-Powell, Marco Antonio Valenzuela- Escarcega, and Mihai Surdeanu. 2017. Swanson linking revisited: Accelerating literature-based dis- covery across domains using a conceptual influence graph. In ACL.
12
+ Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation .
13
+ Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke S. Zettlemoyer. 2017. Learning a neural semantic parser from user feed- back. In ACL.
14
+ Antonio J. Jimeno-Yepes, Bridget T. McInnes, and Alan R. Aronson. 2011. Exploiting mesh indexing in medline to generate a data set for word sense dis- ambiguation. BMC bioinformatics 12(1):223.
15
+ Martin Krallinger, Florian Leitner, Obdulia Rabal, Miguel Vazquez, Julen Oyarzabal, and Alfonso Va- lencia. 2015. CHEMDNER: The drugs and chemi- cal names extraction challenge. In J. Cheminformat- ics.
16
+ Guillaume Lample, Miguel Ballesteros, Sandeep K Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recog- nition. In HLT-NAACL.
17
+ Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci- aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. Biocreative v cdr task cor- pus: a resource for chemical disease relation extrac- tion. Database : the journal of biological databases and curation 2016.
18
+ Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015. Design challenges for entity linking. Transactions of the Association for Computational Linguistics 3:315-328.
19
+ Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju- rafsky. 2009. Distant supervision for relation extrac- tion without labeled data. In ACL.
20
+ Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word rep- resentation. In EMNLP.
21
+ Matthew E. Peters, Waleed Ammar, Chandra Bhagavat- ula, and Russell Power. 2017. Semi-supervised se- quence tagging with bidirectional language models. In ACL.
22
+ Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extracting scientific figures with distantly supervised neural networks. In JCDL.
23
+ Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015. Identifying meaningful citations. In AAAI Workshop (Scholarly Big Data).
24
+ Xiang Wang, Yan Dong, Xiang qian Qi, Yi-Ming Li, Cheng-Guang Huang, and Lijun Hou. 2013. Clin- ical review: Efficacy of antimicrobial-impregnated catheters in external ventricular drainage -a system- atic review and meta-analysis. In Critical care.
25
+ Luca Weihs and Oren Etzioni. 2017. Learning to pre- dict citation-based impact measures. In JCDL.
26
+ Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Alexander Ororbia, Dou- glas Jordan, and C. Lee Giles. 2014. CiteSeerX: AI in a digital library search engine. In AAAI.
27
+ Chenyan Xiong, Russell Power, and Jamie Callan. 2017. Explicit semantic ranking for academic search via knowledge graph embedding. In WWW.