File size: 3,451 Bytes
c6c8e73
 
 
 
 
 
 
 
 
 
5ce3c9b
 
c6c8e73
 
 
 
 
 
4ad4690
95dec68
48ef3e0
95dec68
 
 
 
c6c8e73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e659edc
c6c8e73
 
 
95dec68
 
c6c8e73
 
 
95dec68
 
c6c8e73
 
 
e659edc
c6c8e73
 
 
 
 
8286a03
c6c8e73
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
language: 
- ru
- en
- ru-RU
tags:
- xlm-roberta-large
datasets:
- IlyaGusev/headline_cause
license: apache-2.0
widget:
- text: "Песков опроверг свой перевод на удаленку</s>Дмитрий Песков перешел на удаленку"
---

# XLM-RoBERTa HeadlineCause Full

## Model description

This model was trained to predict the presence of causal relations between two headlines. This model is for the Full task with 7 possible labels: titles are almost the same, A causes B, B causes A, A refutes B, B refutes A, A linked with B in another way, A is not linked to B. English and Russian languages are supported.

You can use hosted inference API to infer a label for a headline pair. To do this, you shoud seperate headlines with ```</s>``` token.
For example:
```
Песков опроверг свой перевод на удаленку</s>Дмитрий Песков перешел на удаленку
```

## Intended uses & limitations

#### How to use

```python
from tqdm.notebook import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

def get_batch(data, batch_size):
    start_index = 0
    while start_index < len(data):
        end_index = start_index + batch_size
        batch = data[start_index:end_index]
        yield batch
        start_index = end_index


def pipe_predict(data, pipe, batch_size=64):
    raw_preds = []
    for batch in tqdm(get_batch(data, batch_size)):
        raw_preds += pipe(batch)
    return raw_preds

MODEL_NAME = TOKENIZER_NAME = "IlyaGusev/xlm_roberta_large_headline_cause_full"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, do_lower_case=False)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.eval()
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, framework="pt", return_all_scores=True)
texts = [
    (
        "Judge issues order to allow indoor worship in NC churches",
        "Some local churches resume indoor services after judge lifted NC governor’s restriction"
    ),
    (
        "Gov. Kevin Stitt defends $2 million purchase of malaria drug touted by Trump",
        "Oklahoma spent $2 million on malaria drug touted by Trump"
    ),
    (
        "Песков опроверг свой перевод на удаленку",
        "Дмитрий Песков перешел на удаленку"
    )
]
pipe_predict(texts, pipe)
```

#### Limitations and bias

The models are intended to be used on news headlines. No other limitations are known.

## Training data

* HuggingFace dataset: [IlyaGusev/headline_cause](https://huggingface.co/datasets/IlyaGusev/headline_cause)
* GitHub: [IlyaGusev/HeadlineCause](https://github.com/IlyaGusev/HeadlineCause)

## Training procedure

* Notebook: [HeadlineCause](https://colab.research.google.com/drive/1NAnD0OJ0TnYCJRsHpYUyYkjr_yi8ObcA)
* Stand-alone script: [train.py](https://github.com/IlyaGusev/HeadlineCause/blob/main/headline_cause/train.py)

## Eval results

Evaluation results can be found in the [arxiv paper](https://arxiv.org/pdf/2108.12626.pdf).

### BibTeX entry and citation info

```bibtex
@misc{gusev2021headlinecause,
      title={HeadlineCause: A Dataset of News Headlines for Detecting Causalities}, 
      author={Ilya Gusev and Alexey Tikhonov},
      year={2021},
      eprint={2108.12626},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```