|
--- |
|
tags: |
|
- feature-extraction |
|
- endpoints-template |
|
license: bsd-3-clause |
|
library_name: generic |
|
--- |
|
|
|
# Coreference Resolution for Long Documents |
|
Modified coreference resolution model from [BERT for Coreference Resolution: Baselines and Analysis](https://aclanthology.org/D19-1588/) for handling long documents (~40K words) efficiently (500K words/s on a NVIDIA Tesla V100). The checkpoint is based on AllenNLP's coref-spanbert-large-2021.03.10. This modified model was used in [DAPR: A Benchmark on Document-Aware Passage Retrieval](https://arxiv.org/abs/2305.13915). |
|
|
|
## Usage |
|
### API call |
|
One can call the Hugging's Inference Endpoints API directly: (need your access token from https://huggingface.co/settings/tokens and the loading takes around 6 minutes) |
|
```python |
|
import requests |
|
import time |
|
|
|
API_URL = "https://api-inference.huggingface.co/models/kwang2049/long-coref" |
|
headers = {"Authorization": "Bearer ${YOUR_HUGGINGFACE_ACCESS_TOKEN}"} |
|
|
|
|
|
def query(payload): |
|
while True: |
|
response = requests.post(API_URL, headers=headers, json=payload) |
|
if response.status_code == 503: |
|
time.sleep(5) |
|
print(response.json()["error"]) |
|
continue |
|
elif response.status_code == 200: |
|
return response.json() |
|
else: |
|
error_message = f"{response.status_code}: {response.json['error']}." |
|
raise requests.HTTPError(error_message) |
|
|
|
|
|
doc = [ |
|
"The Half Moon is a public house and music venue in Putney, London. It is one of the city's longest running live music venues, and has hosted live music every night since 1963.", |
|
"The pub is on the south side of the Lower Richmond road, in the London Borough of Wandsworth." |
|
] |
|
|
|
PARAGRAPH_DELIMITER = "\n\n" |
|
|
|
output = query( |
|
{ |
|
"inputs": PARAGRAPH_DELIMITER.join(doc), |
|
} |
|
) |
|
print(output) |
|
# { |
|
# 'pargraph_sentences': ..., |
|
# 'top_spans': ..., |
|
# 'antecedents': ... |
|
# } |
|
``` |
|
### Local run |
|
One can also run the code of the repo on a local machine: |
|
|
|
```bash |
|
# Clone the repo |
|
git lfs install |
|
git clone https://huggingface.co/kwang2049/long-coref |
|
cd long-coref |
|
pip install -r requirements.txt |
|
python local_run.py |
|
``` |
|
|
|
## Evalution |
|
The evaluation results on OntoNotesv5 are: |
|
|
|
| Model | Precision | Recall | F1 | Input Length| |
|
| --- | --- | --- | --- | --- | |
|
| AllenNLP's original implementation | 79.2 | 78.4 | 78.8 | <= 2K words| |
|
| This modification | 78.9 | 67.0 | 72.4| <= 40K words | |
|
|
|
## Citation |
|
If you use the repo, feel free to cite our publication [DAPR: A Benchmark on Document-Aware Passage Retrieval](https://arxiv.org/abs/2305.13915): |
|
```bibtex |
|
@article{wang2023dapr, |
|
title = "DAPR: A Benchmark on Document-Aware Passage Retrieval", |
|
author = "Kexin Wang and Nils Reimers and Iryna Gurevych", |
|
journal= "arXiv preprint arXiv:2305.13915", |
|
year = "2023", |
|
url = "https://arxiv.org/abs/2305.13915", |
|
} |
|
``` |
|
|