upload

Browse files

Files changed (8) hide show

.gitattributes +27 -0
README.md +78 -0
config.json +23 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,27 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,78 @@

+---
+license: apache-2.0
+language: en
+datasets:
+- wikipedia
+- bookcorpus
+tags:
+- bert
+- exbert
+- linkbert
+- feature-extraction
+- fill-mask
+- question-answering
+- text-classification
+- token-classification
+---
+## LinkBERT-base
+LinkBERT-base model pretrained on English Wikipedia articles along with hyperlink information. It is introduced in the paper [LinkBERT: Pretraining Language Models with Document Links (ACL 2022)](https://arxiv.org/abs/2203.15827). The code and data are available in [this repository](https://github.com/michiyasunaga/LinkBERT).
+## Model description
+LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures **document links** such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.
+LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for **knowledge-intensive** tasks (e.g. question answering) and **cross-document** tasks (e.g. reading comprehension, document retrieval).
+## Intended uses & limitations
+The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification.
+You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).
+### How to use
+To use the model to get the features of a given text in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-base')
+model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-base')
+inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+outputs = model(**inputs)
+last_hidden_states = outputs.last_hidden_state
+```
+For fine-tuning, you can use [this repository](https://github.com/michiyasunaga/LinkBERT) or follow any other BERT fine-tuning codebases.
+## Evaluation results
+When fine-tuned on downstream tasks, LinkBERT achieves the following results.
+**General benchmarks ([MRQA](https://github.com/mrqa/MRQA-Shared-Task-2019) and [GLUE](https://gluebenchmark.com/)):**
+|                         | HotpotQA | TriviaQA | SearchQA | NaturalQ | NewsQA   | SQuAD    | GLUE      |
+| ----------------------  | -------- | -------- | -------- | -------- | ------   | -----    | --------  |
+|                         | F1       | F1       | F1       |  F1      | F1       | F1       | Avg score |
+| BERT-base               | 76.0     | 70.3     | 74.2     | 76.5     | 65.7     | 88.7     | 79.2      |
+| **LinkBERT-base**       | **78.2** | **73.9** | **76.8** | **78.3** | **69.3** | **90.1** | **79.6**  |
+| BERT-large              | 78.1     | 73.7     | 78.3     | 79.0     | 70.9     | 91.1     | 80.7      |
+| **LinkBERT-large**      | **80.8** | **78.2** | **80.5** | **81.0** | **72.6** | **92.7** | **81.1**  |
+## Citation
+If you find LinkBERT useful in your project, please cite the following:
+```bibtex
+@InProceedings{yasunaga2022linkbert,
+  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
+  title =   {LinkBERT: Pretraining Language Models with Document Links},
+  year =    {2022},
+  booktitle = {Association for Computational Linguistics (ACL)},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.9.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 28996
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:750fd38a2e83812c6259263189a27f7e3673f2e9efaec1fb082e06d65619756b
+size 433334839

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "bert-base-cased", "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff