CuBERT: Learning and Evaluating Contextual Embedding of Source Code

Overview

This model is the unofficial HuggingFace version of "CuBERT". In particular, this version comes from gs://cubert/20210711_Java/pre_trained_model_epochs_2__length_2048. It was trained 2021-07-11 for 2 epochs with a 2048 token context window on the Java BigQuery dataset. I manually converted the Tensorflow checkpoint to PyTorch and have uploaded it here. The tokenizer has not been converted yet. All credit goes to Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi.

The other versions are available here:

cubert-20210711-Python-512

cubert-20210711-Python-1024

cubert-20210711-Python-2048

cubert-20210711-Java-512

cubert-20210711-Java-1024

cubert-20210711-Java-2048

Citation:

@inproceedings{cubert,
author    = {Aditya Kanade and
             Petros Maniatis and
             Gogul Balakrishnan and
             Kensen Shi},
title     = {Learning and evaluating contextual embedding of source code},
booktitle = {Proceedings of the 37th International Conference on Machine Learning,
               {ICML} 2020, 12-18 July 2020},
series    = {Proceedings of Machine Learning Research},
publisher = {{PMLR}},
year      = {2020},
}