1 ---
2 language:
3 - ru
4 - en
5 - ru-RU
6 tags:
7 - xlm-roberta-large
8 datasets:
9 - IlyaGusev/headline_cause
10 license: apache-2.0
11 widget:
12 - text: "Песков опроверг свой перевод на удаленку</s>Дмитрий Песков перешел на удаленку"
13 ---
14
15 # XLM-RoBERTa HeadlineCause Full
16
17 ## Model description
18
19 This model was trained to predict the presence of causal relations between two headlines. This model is for the Full task with 7 possible labels: titles are almost the same, A causes B, B causes A, A refutes B, B refutes A, A linked with B in another way, A is not linked to B. English and Russian languages are supported.
20
21 You can use hosted inference API to infer a label for a headline pair. To do this, you shoud seperate headlines with ```</s>``` token.
22 For example:
23 ```
24 Песков опроверг свой перевод на удаленку</s>Дмитрий Песков перешел на удаленку
25 ```
26
27 ## Intended uses & limitations
28
29 #### How to use
30
31 ```python
32 from tqdm.notebook import tqdm
33 from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
34
35 def get_batch(data, batch_size):
36 start_index = 0
37 while start_index < len(data):
38 end_index = start_index + batch_size
39 batch = data[start_index:end_index]
40 yield batch
41 start_index = end_index
42
43
44 def pipe_predict(data, pipe, batch_size=64):
45 raw_preds = []
46 for batch in tqdm(get_batch(data, batch_size)):
47 raw_preds += pipe(batch)
48 return raw_preds
49
50 MODEL_NAME = TOKENIZER_NAME = "IlyaGusev/xlm_roberta_large_headline_cause_full"
51 tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, do_lower_case=False)
52 model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
53 model.eval()
54 pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, framework="pt", return_all_scores=True)
55 texts = [
56 (
57 "Judge issues order to allow indoor worship in NC churches",
58 "Some local churches resume indoor services after judge lifted NC governor’s restriction"
59 ),
60 (
61 "Gov. Kevin Stitt defends $2 million purchase of malaria drug touted by Trump",
62 "Oklahoma spent $2 million on malaria drug touted by Trump"
63 ),
64 (
65 "Песков опроверг свой перевод на удаленку",
66 "Дмитрий Песков перешел на удаленку"
67 )
68 ]
69 pipe_predict(texts, pipe)
70 ```
71
72 #### Limitations and bias
73
74 The models are intended to be used on news headlines. No other limitations are known.
75
76 ## Training data
77
78 * HuggingFace dataset: [IlyaGusev/headline_cause](https://huggingface.co/datasets/IlyaGusev/headline_cause)
79 * GitHub: [IlyaGusev/HeadlineCause](https://github.com/IlyaGusev/HeadlineCause)
80
81 ## Training procedure
82
83 * Notebook: [HeadlineCause](https://colab.research.google.com/drive/1NAnD0OJ0TnYCJRsHpYUyYkjr_yi8ObcA)
84 * Stand-alone script: [train.py](https://github.com/IlyaGusev/HeadlineCause/blob/main/headline_cause/train.py)
85
86 ## Eval results
87
88 Evaluation results can be found in the [arxiv paper](https://arxiv.org/pdf/2108.12626.pdf).
89
90 ### BibTeX entry and citation info
91
92 ```bibtex
93 @misc{gusev2021headlinecause,
94 title={HeadlineCause: A Dataset of News Headlines for Detecting Causalities},
95 author={Ilya Gusev and Alexey Tikhonov},
96 year={2021},
97 eprint={2108.12626},
98 archivePrefix={arXiv},
99 primaryClass={cs.CL}
100 }
101 ```
102