Edit model card

Model Architecture

This model follows the distilroberta-base architecture. Futhermore, this model was initialized with the checkpoint of distilroberta-base.

Pre-training phase

This model was pre-trained with the MLM objective (mlm_probability=0.15).

During this phase, the inputs had two formats. One is the following: [[CLS],t1,,tn,[SEP],w1,,wm[EOS]]\left[[CLS], t_1, \dots, t_n, [SEP], w_1, \dots, w_m\right[EOS]] where $t_1, \dots, t_n$ are the code tokens and $w_1, \dots, w_m$ are the natural language description tokens. More concretely, this is the snippet that tokenizes the input:

def tokenize_function_bimodal(examples, tokenizer, max_len):
    codes = [' '.join(example) for example in examples['func_code_tokens']]
    nls = [' '.join(example) for example in examples['func_documentation_tokens']]
    pairs = [[c, nl] for c, nl in zip(codes, nls)]
    return tokenizer(pairs, max_length=max_len, padding="max_length", truncation=True)

The other format is: [[CLS],t1,,tn[EOS]]\left[[CLS], t_1, \dots, t_n \right[EOS]] where $t_1, \dots, t_n$ are the code tokens. More concretely, this is the snippet that tokenizes the input:

def tokenize_function_unimodal(examples, tokenizer, max_len, tokens_column):
    codes = [' '.join(example) for example in examples[tokens_column]]
    return tokenizer(codes, max_length=max_len, padding="max_length", truncation=True)

Training details

  • Max length: 512
  • Effective batch size: 64
  • Total steps: 140000
  • Learning rate: 5e-4

Usage

model = AutoModelForMaskedLM.from_pretrained('antolin/distilroberta-base-csn-python-unimodal-bimodal')
tokenizer = AutoTokenizer.from_pretrained('antolin/distilroberta-base-csn-python-unimodal-bimodal')
mask_filler = pipeline("fill-mask", model=model, tokenizer=tokenizer)
code_tokens = ["def", "<mask>", "(", "a", ",", "b", ")", ":", "if", "a", ">", "b", ":", "return", "a", "else", "return", "b"]
nl_tokens = ["return", "the", "maximum", "value"]
input_text = ' '.join(code_tokens) + tokenizer.sep_token + ' '.join(nl_tokens)
pprint(mask_filler(input_text, top_k=5))
[{'score': 0.7177600860595703,
  'sequence': 'def maximum ( a, b ) : if a > b : return a else return breturn '
              'the maximum value',
  'token': 4532,
  'token_str': ' maximum'},
 {'score': 0.22075247764587402,
  'sequence': 'def max ( a, b ) : if a > b : return a else return breturn the '
              'maximum value',
  'token': 19220,
  'token_str': ' max'},
 {'score': 0.015111264772713184,
  'sequence': 'def minimum ( a, b ) : if a > b : return a else return breturn '
              'the maximum value',
  'token': 3527,
  'token_str': ' minimum'},
 {'score': 0.007394665852189064,
  'sequence': 'def min ( a, b ) : if a > b : return a else return breturn the '
              'maximum value',
  'token': 5251,
  'token_str': ' min'},
 {'score': 0.004020793363451958,
  'sequence': 'def length ( a, b ) : if a > b : return a else return breturn '
              'the maximum value',
  'token': 5933,
  'token_str': ' length'}]
Downloads last month
2
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train antolin/distilroberta-base-csn-python-unimodal-bimodal