ai-soco-c++-roberta-small

Model description

From scratch pre-trained RoBERTa model with 6 layers and 12 attention heads using AI-SOCO dataset which consists of C++ codes crawled from CodeForces website.

Intended uses & limitations

The model can be used to do code classification, authorship identification and other downstream tasks on C++ programming language.

How to use

You can use the model directly after tokenizing the text using the provided tokenizer with the model files.

Limitations and bias

The model is limited to C++ programming language only.

Training data

The model initialized randomly and trained using AI-SOCO dataset which contains 100K C++ source codes.

Training procedure

The model trained on Google Colab platform with 8 TPU cores for 200 epochs, 16*8 batch size, 512 max sequence length and MLM objective. Other parameters were defaulted to the values mentioned in run_language_modelling.py script. Each continues 4 spaces were converted to a single tab character (\t) before tokenization.

BibTeX entry and citation info

@inproceedings{ai-soco-2020-fire,
    title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}",
    author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo",
    booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)",
    year = "2020"
}