metadata

license: cc-by-sa-4.0
datasets:
  - bigcode/the-stack-dedup
  - sadiqj/opam-source
tags:
  - code
language:
  - code
programming_language:
  - OCaml

camlcoder

Model Description

camlcoder is a 2.7B Causal Language Model focused on Code Completion for OCaml. It is a fine-tuned version of replit-code-v1-3b. The model has been trained on a subset of the Stack Dedup v1.2 dataset and the most recent version of all packages in Opam that compile on OCaml 5.0.

License

The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0).

Contact

For questions and comments about the model, please post in the community section.

How to Use

First of all, you need to install the latest versions of the following dependencies:

einops
sentencepiece
safetensors
torch
transformers

You can then use the model as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, StoppingCriteria, StoppingCriteriaList
import torch

max_length = 256

tokenizer = AutoTokenizer.from_pretrained('sadiqj/camlcoder', trust_remote_code=True, max_length=max_length, use_safetensors=True)
model = AutoModelForCausalLM.from_pretrained('sadiqj/camlcoder', trust_remote_code=True, use_safetensors=True).to(device='cuda:0', dtype=torch.bfloat16)

input_ids = tokenizer.encode('(* Return the middle element of the list *)\nlet get_middle l =', return_tensors='pt').to(device='cuda:0')

newline_id = tokenizer.encode('\n\n', return_tensors='pt')[0][0].item()
class StopOnNewlines(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        return newline_id in input_ids

output = model.generate(input_ids, max_length=max_length, stopping_criteria=StoppingCriteriaList([StopOnNewlines()]), use_cache=True)

print(tokenizer.decode(output[0], skip_special_tokens=True))