--- pipeline_tag: text-generation inference: true widget: - text: 'def print_hello_world():' example_title: Hello world group: Python license: bigscience-openrail-m datasets: - books - arxiv - c4 - falcon-refinedweb - wiki - github-issues - stack_markdown library_name: transformers tags: - code language: - en --- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/643a9dd0c5f633a7fa7e804a/HkB0QYV0BbmB3ktMugbZy.png) # Refact-1.6B-base Finally, the model we started training with our [blog post](https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/) is ready 🎉 The model might contain some problems, especially with the FIM format # It Works As a Chat The primary application of this model is code completion (infill) in multiple programming languages. But it works as a chat quite well. # Example Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: ```python # pip install -q transformers from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "smallcloudai/Refact-1_6B-fim" device = "cuda" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device) prompt = 'def print_hello_world():\n """\n print("Hello world!")' inputs = tokenizer.encode(prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_length=100, temperature=0.2) print("-"*80) print(tokenizer.decode(outputs[0])) ``` # Chat Format The same model works as chat (experimental). ```python prompt_template = "SYSTEM {system}\n" \ "USER {query}\n" \ "ASSISTANT" prompt = prompt_template.format(system="You are a programming assistant", query="How do I sort a list in Python?") ``` # Architecture As described in more detail in the blog post, we used: - [ALiBi](https://arxiv.org/abs/2108.12409) based attention - [LayerNorm](https://arxiv.org/abs/1607.06450v1) instead of [RMSNorm](https://arxiv.org/pdf/1910.07467.pdf) - [Multi Query Attention](https://arxiv.org/abs/1911.02150) We also used LiON, flash attention, early dropout. It's not that innovative that you can't run it, in fact you can -- see an example below. # Training For the base model, we used our own dataset that contains code with permissive licenses only, and open text datasets. Filtering is the key to success of this model: - We only used text in English - Only topics related to computer science - Applied heavy deduplication The text to code proportion was 50:50, model trained for 1.2T tokens. We don't release the base model, because its Fill-in-the-Middle (FIM) capability likes to repeat itself too much, so its practical use is limited. But if you still want it, write us a message on Discord. # Limitations and Bias The Refact-1.6B model was trained on text in English. But it has seen a lot more languages in code comments. Its performance on non-English languages is lower, for sure. # Model Stats - **Architecture:** LLAMA-like model with multi-query attention - **Objectives** Fill-in-the-Middle, Chat - **Tokens context:** 4096 - **Pretraining tokens:** 1.2T - **Finetuning tokens:** 40B - **Precision:** bfloat16 - **GPUs** 64 NVidia A5000 - **Training time** 28 days # License The model is licensed under the BigScience OpenRAIL-M v1 license agreement # Citation If you are using this model, please give a link to this page.