--- language: - ru - en - ru-RU tags: - xlm-roberta-large datasets: - IlyaGusev/headline_cause license: apache-2.0 --- # XLM-RoBERTa HeadlineCause Full ## Model description This model was trained to predict the presence of causal relations between two headlines. This model is for the Full task with 7 possible labels: titles are almost the same, A causes B, B causes A, A refutes B, B refutes A, A linked with B in another way, A is not linked to B. English and Russian languages are supported. You can use hosted inference API to infer a label for a headline pair. To do this, you shoud seperate headlines with `````` token. For example: ``` Песков опроверг свой перевод на удаленкуДмитрий Песков перешел на удаленку ``` ## Intended uses & limitations #### How to use ```python from tqdm.notebook import tqdm from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline def get_batch(data, batch_size): start_index = 0 while start_index < len(data): end_index = start_index + batch_size batch = data[start_index:end_index] yield batch start_index = end_index def pipe_predict(data, pipe, batch_size=64): raw_preds = [] for batch in tqdm(get_batch(data, batch_size)): raw_preds += pipe(batch) return raw_preds MODEL_NAME = TOKENIZER_NAME = "IlyaGusev/xlm_roberta_large_headline_cause_full" tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, do_lower_case=False) model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME) model.eval() pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, framework="pt", return_all_scores=True) texts = [ ( "Judge issues order to allow indoor worship in NC churches", "Some local churches resume indoor services after judge lifted NC governor’s restriction" ), ( "Gov. Kevin Stitt defends $2 million purchase of malaria drug touted by Trump", "Oklahoma spent $2 million on malaria drug touted by Trump" ), ( "Песков опроверг свой перевод на удаленку", "Дмитрий Песков перешел на удаленку" ) ] pipe_predict(texts, pipe) ``` #### Limitations and bias The models are intended to be used on news headlines. No other limitations are known. ## Training data * HuggingFace dataset: [IlyaGusev/headline_cause](https://huggingface.co/datasets/IlyaGusev/headline_cause) * GitHub: [IlyaGusev/HeadlineCause](https://github.com/IlyaGusev/HeadlineCause) ## Training procedure * Notebook: [HeadlineCause](https://colab.research.google.com/drive/1NAnD0OJ0TnYCJRsHpYUyYkjr_yi8ObcA) * Stand-alone script: [train.py](https://github.com/IlyaGusev/HeadlineCause/blob/main/headline_cause/train.py) ## Eval results Evaluation results can be found in the [arxiv paper](https://arxiv.org/pdf/2108.12626.pdf). ### BibTeX entry and citation info ```bibtex @misc{gusev2021headlinecause, title={HeadlineCause: A Dataset of News Headlines for Detecting Causalities}, author={Ilya Gusev and Alexey Tikhonov}, year={2021}, eprint={2108.12626}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```