McGill-NLP/codellm_1b_rotary

This model is a 1B-scale decoder-only transformer designed to explore the impact of positional encoding on length generalization, specifically trained with Rotary positional encoding to assess its effectiveness in length generalization tasks.

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "McGill-NLP/codellm_1b_rotary"

# Important: `trust_remote_code=True` is required due to
# the custom architecture supporting different positional encodings,
# necessitating the download of the model implementation from Huggingface
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(model.config.position_encoding_type)
# Outputs: `rotary`

prompt = "def print_hello_world():"
input_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
input_ids = torch.cat([
  torch.tensor([[tokenizer.bos_token_id]], device="cuda"), input_ids
], dim=1)  # Prepend <bos> token

output = model.generate(input_ids, do_sample=True, temperature=0.2, max_length=16)
print(tokenizer.decode(output[0]))

Model Details

Model Description

Developed by: McGill NLP Group
Model type: Decoder-only transformer
Language(s) (NLP): Primarily English, with potential application across various programming languages as demonstrated by its training on a code dataset.
License: Apache 2.0
Finetuned from model: This model is pretrained from scratch.

Model Sources

Repository: McGill-NLP/Length-Generalization GitHub Repository
Paper: The Impact of Positional Encoding on Length Generalization in Transformers

Uses

Direct Use

The model is designed for direct application in NLP tasks that require understanding and generating text. It's especially suited for working with source code, making it a valuable tool for tasks such as code completion, bug fixing, or even code generation.

Bias, Risks, and Limitations

Given the model's training on source code, it might inherit biases present in the underlying dataset, including but not limited to, biases towards more commonly used programming languages or coding styles. Users should be cautious when applying this model to diverse or underrepresented coding languages and contexts. This model has not undergone safety training and it is only produced for research purposes. The user is soley responsible for outputs of this model.

Recommendations

Users should consider the context and diversity of the application domain when employing this model, especially in critical systems. Further evaluation and fine-tuning might be necessary to mitigate any potential biases or limitations for specific use cases.

How to Get Started with the Model

Use the example provided in the README to get started with generating text or code. Ensure you have the necessary dependencies installed, including torch and transformers, and follow the guidelines for setting up your environment.

Training Details

Training Data

The model was pretrained on a dataset comprising 30M source code files from the StarCoder corpus, amounting to 30B token. The training data mix:

40% Python
25% Java
25% JavaScript
5% GitHub issues
5% GitHub commits

Training Procedure

The model follows a decoder-only architecture with 1.3 billion parameters and was trained to predict the next token in the sequence. For more detailed information on the training procedure, refer to the paper linked above.

Technical Specifications

Model Architecture and Objective

The model leverages a decoder-only transformer architecture with Rotary positional encoding.

Citation

Please cite the following paper if you use this model in your work:

@inproceedings{kazemnejad2023:ImpactOfPeOnLengthGen,
      title={The Impact of Positional Encoding on Length Generalization in Transformers},
      author={Amirhossein Kazemnejad and Inkit Padhi and Karthikeyan Natesan and Payel Das and Siva Reddy},
      booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
      year={2023},
      url={https://openreview.net/forum?id=Drrl2gcjzl}
}

More Information

For further details about the model's architecture, training, and applications, please refer to the paper and the GitHub repository linked above.

McGill-NLP
/

codellm_1b_rotary