File size: 2,885 Bytes
fddb8f0
5b9b463
 
 
30adfc7
5b9b463
8d01c7e
 
 
e9b5d4b
8d01c7e
 
 
e9b5d4b
8d01c7e
 
 
 
 
b2bad51
8d01c7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16d87e3
8d01c7e
 
e9b5d4b
89c974c
 
 
 
 
 
e9b5d4b
8d01c7e
 
 
 
 
 
 
 
 
 
30adfc7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
tags:
- feature-extraction
- endpoints-template
license: apache-2.0
library_name: generic
---

# Coreference Resolution for Long Documents
Modified coreference resolution model from [BERT for Coreference Resolution: Baselines and Analysis](https://aclanthology.org/D19-1588/) for handling long documents (~40K words) efficiently (500K words/s on a NVIDIA Tesla V100). The checkpoint is based on AllenNLP's coref-spanbert-large-2021.03.10. This modified model was used in [DAPR: A Benchmark on Document-Aware Passage Retrieval](https://arxiv.org/abs/2305.13915).

## Usage
### API call
One can call the Hugging's Inference Endpoints API directly: (need your access token from https://huggingface.co/settings/tokens and the loading takes around 6 minutes)
```python
import requests
import time

API_URL = "https://api-inference.huggingface.co/models/kwang2049/long-coref"
headers = {"Authorization": "Bearer ${YOUR_HUGGINGFACE_ACCESS_TOKEN}"}


def query(payload):
    while True:
        response = requests.post(API_URL, headers=headers, json=payload)
        if response.status_code == 503:
            time.sleep(5)
            print(response.json()["error"])
            continue
        elif response.status_code == 200:
            return response.json()
        else:
            error_message = f"{response.status_code}: {response.json['error']}."
            raise requests.HTTPError(error_message)


doc = [
    "The Half Moon is a public house and music venue in Putney, London. It is one of the city's longest running live music venues, and has hosted live music every night since 1963.",
    "The pub is on the south side of the Lower Richmond road, in the London Borough of Wandsworth."
]

PARAGRAPH_DELIMITER = "\n\n"

output = query(
    {
        "inputs": PARAGRAPH_DELIMITER.join(doc),
    }
)
print(output)
# {
#     'pargraph_sentences': ..., 
#     'top_spans': ..., 
#     'antecedents': ...
# }
```
### Local run
One can also run the code of the repo on a local machine:

```bash
# Clone the repo
git lfs install
git clone https://huggingface.co/kwang2049/long-coref
cd long-coref && pip install -r requirements.txt && python local_run.py
```

## Evalution
The evaluation results on OntoNotesv5 are:

| Model | Precision | Recall | F1 | Input Length|
| --- |  --- | --- | --- | --- |
| AllenNLP's original implementation | 79.2 | 78.4 | 78.8 | <= 2K words|
| This modification | 78.9 | 67.0 | 72.4| <= 40K words |

## Citation
If you use the repo, feel free to cite our publication [DAPR: A Benchmark on Document-Aware Passage Retrieval](https://arxiv.org/abs/2305.13915):
```bibtex 
@article{wang2023dapr,
    title = "DAPR: A Benchmark on Document-Aware Passage Retrieval",
    author = "Kexin Wang and Nils Reimers and Iryna Gurevych", 
    journal= "arXiv preprint arXiv:2305.13915",
    year = "2023",
    url = "https://arxiv.org/abs/2305.13915",
}
```