ncoop57 commited on
Commit
e1d95c5
1 Parent(s): 6fdf219

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  ## Model Description
4
 
5
- GPT-Neo-125M-Code-Clippy is a [GPT-Neo-125M model](https://huggingface.co/EleutherAI/gpt-neo-125M) finetuned using causal language modeling on our version of the Code Clippy Data dataset that has duplicates, which was scraped from public Github repositories (more information in the provided link). This model is specialized to autocomplete methods in multiple programming languages. As discussed in OpenAI's [Codex paper](https://arxiv.org/abs/2107.03374), we modified the GPT-Neo model and tokenizer to accommodate for additional whitespace characters. Specifically, we add the following tokens `["\t\t", " ", " ", " "]` and since they are all related to indentation, we initialize the embedding layer of these tokens with the same weights as the `\t` token already present in the model in hopes the model will learn to associate these whitespace characters with indentation faster.
6
 
7
  ## Training data
8
 
 
2
 
3
  ## Model Description
4
 
5
+ GPT-Neo-125M-Code-Clippy is a [GPT-Neo-125M model](https://huggingface.co/EleutherAI/gpt-neo-125M) finetuned using causal language modeling on our version of the Code Clippy Data dataset that has duplicates, which was scraped from public Github repositories (more information in the provided link). This model is specialized to autocomplete methods in multiple programming languages. As discussed in OpenAI's [Codex paper](https://arxiv.org/abs/2107.03374), we modified the GPT-Neo model and tokenizer to accommodate for additional whitespace characters. Specifically, we add the following tokens `["\t\t", " ", " ", " "]` and since they are all related to indentation, we initialize the embedding layer of these tokens with the same weights as the `\t` token already present in the model in hopes the model will learn to associate these whitespace characters with indentation faster. A script to automatically do this can be found [here](https://github.com/ncoop57/gpt-code-clippy/blob/camera-ready/training/utilities/add_new_tokens.py).
6
 
7
  ## Training data
8