# Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020 Models and predictions for submission to TRAC - 2020 Second Workshop on Trolling, Aggression and Cyberbullying. Our trained models as well as evaluation metrics during traing are available at: https://databank.illinois.edu/datasets/IDB-8882752# We also make a few of our models available in HuggingFace's models repository at https://huggingface.co/socialmediaie/, these models can be further fine-tuned on your dataset of choice. Our approach is described in our paper titled: > Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020). The source code for training this model and more details can be found on our code repository: https://github.com/socialmediaie/TRAC2020 NOTE: These models are retrained for uploading here after our submission so the evaluation measures may be slightly different from the ones reported in the paper. If you plan to use the dataset please cite the following resources: * Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020). * Mishra, Shubhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. “Trained Models for Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020.” University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8882752_V1. ``` @inproceedings{Mishra2020TRAC, author = {Mishra, Sudhanshu and Prasad, Shivangi and Mishra, Shubhanshu}, booktitle = {Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020)}, title = {{Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}}, year = {2020} } @data{illinoisdatabankIDB-8882752, author = {Mishra, Shubhanshu and Prasad, Shivangi and Mishra, Shubhanshu}, doi = {10.13012/B2IDB-8882752_V1}, publisher = {University of Illinois at Urbana-Champaign}, title = {{Trained models for Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}}, url = {https://doi.org/10.13012/B2IDB-8882752{\_}V1}, year = {2020} } ``` ## Usage The models can be used via the following code: ```python from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification import torch from pathlib import Path from scipy.special import softmax import numpy as np import pandas as pd TASK_LABEL_IDS = { "Sub-task A": ["OAG", "NAG", "CAG"], "Sub-task B": ["GEN", "NGEN"], "Sub-task C": ["OAG-GEN", "OAG-NGEN", "NAG-GEN", "NAG-NGEN", "CAG-GEN", "CAG-NGEN"] } model_version="databank" # other option is hugging face library if model_version == "databank": # Make sure you have downloaded the required model file from https://databank.illinois.edu/datasets/IDB-8882752 # Unzip the file at some model_path (we are using: "databank_model") model_path = next(Path("databank_model").glob("./*/output/*/model")) # Assuming you get the following type of structure inside "databank_model" # 'databank_model/ALL/Sub-task C/output/bert-base-multilingual-uncased/model' lang, task, _, base_model, _ = model_path.parts tokenizer = AutoTokenizer.from_pretrained(base_model) model = AutoModelForSequenceClassification.from_pretrained(model_path) else: lang, task, base_model = "ALL", "Sub-task C", "bert-base-multilingual-uncased" base_model = f"socialmediaie/TRAC2020_{lang}_{lang.split()[-1]}_{base_model}" tokenizer = AutoTokenizer.from_pretrained(base_model) model = AutoModelForSequenceClassification.from_pretrained(base_model) # For doing inference set model in eval mode model.eval() # If you want to further fine-tune the model you can reset the model to model.train() task_labels = TASK_LABEL_IDS[task] sentence = "This is a good cat and this is a bad dog." processed_sentence = f"{tokenizer.cls_token} {sentence}" tokens = tokenizer.tokenize(sentence) indexed_tokens = tokenizer.convert_tokens_to_ids(tokens) tokens_tensor = torch.tensor([indexed_tokens]) with torch.no_grad(): logits, = model(tokens_tensor, labels=None) logits preds = logits.detach().cpu().numpy() preds_probs = softmax(preds, axis=1) preds = np.argmax(preds_probs, axis=1) preds_labels = np.array(task_labels)[preds] print(dict(zip(task_labels, preds_probs[0])), preds_labels) """You should get an output as follows: ({'CAG-GEN': 0.06762535, 'CAG-NGEN': 0.03244293, 'NAG-GEN': 0.6897794, 'NAG-NGEN': 0.15498641, 'OAG-GEN': 0.034373745, 'OAG-NGEN': 0.020792078}, array(['NAG-GEN'], dtype='