IlyaGusev
/

xlm_roberta_large_headline_cause_full

Text Classification

xlm-roberta-large

Inference Endpoints

Model card Files Files and versions Community

xlm_roberta_large_headline_cause_full / README.md

IlyaGusev's picture

Update README.md

8286a03 over 2 years ago

|

raw history blame

No virus

3.29 kB

	---
	language:
	- ru
	- en
	- ru-RU
	tags:
	- xlm-roberta-large
	datasets:
	- IlyaGusev/headline_cause
	license: apache-2.0
	---

	# XLM-RoBERTa HeadlineCause Full

	## Model description

	This model was trained to predict the presence of causal relations between two headlines. This model is for the Full task with 7 possible labels: titles are almost the same, A causes B, B causes A, A refutes B, B refutes A, A linked with B in another way, A is not linked to B. English and Russian languages are supported.

	You can use hosted inference API to infer a label for a headline pair. To do this, you shoud seperate headlines with ```</s>``` token.
	For example:
	```
	Песков опроверг свой перевод на удаленку</s>Дмитрий Песков перешел на удаленку
	```

	## Intended uses & limitations

	#### How to use

	```python
	from tqdm.notebook import tqdm
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	def get_batch(data, batch_size):
	start_index = 0
	while start_index < len(data):
	end_index = start_index + batch_size
	batch = data[start_index:end_index]
	yield batch
	start_index = end_index


	def pipe_predict(data, pipe, batch_size=64):
	raw_preds = []
	for batch in tqdm(get_batch(data, batch_size)):
	raw_preds += pipe(batch)
	return raw_preds

	MODEL_NAME = TOKENIZER_NAME = "IlyaGusev/xlm_roberta_large_headline_cause_full"
	tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, do_lower_case=False)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
	model.eval()
	pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, framework="pt", return_all_scores=True)
	texts = [
	(
	"Judge issues order to allow indoor worship in NC churches",
	"Some local churches resume indoor services after judge lifted NC governor’s restriction"
	),
	(
	"Gov. Kevin Stitt defends $2 million purchase of malaria drug touted by Trump",
	"Oklahoma spent $2 million on malaria drug touted by Trump"
	),
	(
	"Песков опроверг свой перевод на удаленку",
	"Дмитрий Песков перешел на удаленку"
	)
	]
	pipe_predict(texts, pipe)
	```

	#### Limitations and bias

	The models are intended to be used on news headlines. No other limitations are known.

	## Training data

	* HuggingFace dataset: [IlyaGusev/headline_cause](https://huggingface.co/datasets/IlyaGusev/headline_cause)
	* GitHub: [IlyaGusev/HeadlineCause](https://github.com/IlyaGusev/HeadlineCause)

	## Training procedure

	* Notebook: [HeadlineCause](https://colab.research.google.com/drive/1NAnD0OJ0TnYCJRsHpYUyYkjr_yi8ObcA)
	* Stand-alone script: [train.py](https://github.com/IlyaGusev/HeadlineCause/blob/main/headline_cause/train.py)

	## Eval results

	Evaluation results can be found in the [arxiv paper](https://arxiv.org/pdf/2108.12626.pdf).

	### BibTeX entry and citation info

	```bibtex
	@misc{gusev2021headlinecause,
	title={HeadlineCause: A Dataset of News Headlines for Detecting Causalities},
	author={Ilya Gusev and Alexey Tikhonov},
	year={2021},
	eprint={2108.12626},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```