File size: 1,898 Bytes

38a0c81
f9542a4
 
 
 
 
 
 
 
38a0c81
f9542a4
 
38a0c81
 
f9542a4
38a0c81
783f956
38a0c81
f9542a4
38a0c81
f9542a4
38a0c81
f9542a4
 
 
38a0c81
f9542a4
 
38a0c81
f9542a4
 
 
 
 
38a0c81
f9542a4
 
38a0c81
f9542a4

---
license: other
license_name: inf
license_link: https://huggingface.co/infly/OpenCoder-1.5B-Base/blob/main/LICENSE
language:
- en
- zh
base_model: infly/OpenCoder-1.5B-Base
pipeline_tag: text-generation
library_name: transformers
tags:
- code
---

## Description

This model is derived from [OpenCoder-1.5B-Base](https://huggingface.co/infly/OpenCoder-1.5B-Base) by applying additional context extension fine-tuning. The repository context is composed using the _Half-memory `.py` irrelevant_ composer, more details on which, along with others, can be found in the [On Pretraining for Project-Level Code Completion](https://openreview.net/forum?id=t9RN9WX4Ic) paper ([arxiv](https://arxiv.org/abs/2510.13697)). Specifically, Section A.1 of the Appendix describes the context composition method, and Table 3 provides a comparison with other composers from the same [collection](https://huggingface.co/collections/JetBrains-Research/repository-level-pre-trained-opencoder-68e938c003be1cfba9c3595e).

We publish this checkpoint to support the reproducibility and accessibility of our research results.

## Quickstart

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "JetBrains-Research/OpenCoder-1.5B-Half-Memory-Py-Irrelevant"
tokenizer_name = "infly/OpenCoder-1.5B-Base"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)

inputs = tokenizer("# write a quick sort algorithm", return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```