system HF staff

Update README.md

fc09ec8 over 4 years ago

5.08 kB

	# Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020

	Models and predictions for submission to TRAC - 2020 Second Workshop on Trolling, Aggression and Cyberbullying.

	Our trained models as well as evaluation metrics during traing are available at: https://databank.illinois.edu/datasets/IDB-8882752#
	We also make a few of our models available in HuggingFace's models repository at https://huggingface.co/socialmediaie/, these models can be further fine-tuned on your dataset of choice.

	Our approach is described in our paper titled:

	> Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).

	The source code for training this model and more details can be found on our code repository: https://github.com/socialmediaie/TRAC2020

	NOTE: These models are retrained for uploading here after our submission so the evaluation measures may be slightly different from the ones reported in the paper.

	If you plan to use the dataset please cite the following resources:

	* Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).
	* Mishra, Shubhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. “Trained Models for Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020.” University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8882752_V1.


	```
	@inproceedings{Mishra2020TRAC,
	author = {Mishra, Sudhanshu and Prasad, Shivangi and Mishra, Shubhanshu},
	booktitle = {Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020)},
	title = {{Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}},
	year = {2020}
	}

	@data{illinoisdatabankIDB-8882752,
	author = {Mishra, Shubhanshu and Prasad, Shivangi and Mishra, Shubhanshu},
	doi = {10.13012/B2IDB-8882752_V1},
	publisher = {University of Illinois at Urbana-Champaign},
	title = {{Trained models for Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}},
	url = {https://doi.org/10.13012/B2IDB-8882752{\_}V1},
	year = {2020}
	}
	```


	## Usage

	The models can be used via the following code:

	```python
	from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
	import torch
	from pathlib import Path
	from scipy.special import softmax
	import numpy as np
	import pandas as pd

	TASK_LABEL_IDS = {
	"Sub-task A": ["OAG", "NAG", "CAG"],
	"Sub-task B": ["GEN", "NGEN"],
	"Sub-task C": ["OAG-GEN", "OAG-NGEN", "NAG-GEN", "NAG-NGEN", "CAG-GEN", "CAG-NGEN"]
	}

	model_version="databank" # other option is hugging face library
	if model_version == "databank":
	# Make sure you have downloaded the required model file from https://databank.illinois.edu/datasets/IDB-8882752
	# Unzip the file at some model_path (we are using: "databank_model")
	model_path = next(Path("databank_model").glob(".//output//model"))
	# Assuming you get the following type of structure inside "databank_model"
	# 'databank_model/ALL/Sub-task C/output/bert-base-multilingual-uncased/model'
	lang, task, _, base_model, _ = model_path.parts
	tokenizer = AutoTokenizer.from_pretrained(base_model)
	model = AutoModelForSequenceClassification.from_pretrained(model_path)
	else:
	lang, task, base_model = "ALL", "Sub-task C", "bert-base-multilingual-uncased"
	base_model = f"socialmediaie/TRAC2020_{lang}_{lang.split()[-1]}_{base_model}"
	tokenizer = AutoTokenizer.from_pretrained(base_model)
	model = AutoModelForSequenceClassification.from_pretrained(base_model)

	# For doing inference set model in eval mode
	model.eval()
	# If you want to further fine-tune the model you can reset the model to model.train()

	task_labels = TASK_LABEL_IDS[task]

	sentence = "This is a good cat and this is a bad dog."
	processed_sentence = f"{tokenizer.cls_token} {sentence}"
	tokens = tokenizer.tokenize(sentence)
	indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
	tokens_tensor = torch.tensor([indexed_tokens])

	with torch.no_grad():
	logits, = model(tokens_tensor, labels=None)
	logits


	preds = logits.detach().cpu().numpy()
	preds_probs = softmax(preds, axis=1)
	preds = np.argmax(preds_probs, axis=1)
	preds_labels = np.array(task_labels)[preds]
	print(dict(zip(task_labels, preds_probs[0])), preds_labels)
	"""You should get an output as follows:

	({'CAG-GEN': 0.06762535,
	'CAG-NGEN': 0.03244293,
	'NAG-GEN': 0.6897794,
	'NAG-NGEN': 0.15498641,
	'OAG-GEN': 0.034373745,
	'OAG-NGEN': 0.020792078},
	array(['NAG-GEN'], dtype='<U8'))

	"""

	```