flagopen
/

starcoderbase-1b-taco

Text Generation

competition-level_code_generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

starcoderbase-1b-taco / README.md

rongaoli's picture

Update README.md

996d653 verified about 2 months ago

|

history blame contribute delete

No virus

2.62 kB

	---
	language:
	- en
	- code
	license: bigcode-openrail-m
	tags:
	- starcoder
	- code_synthesis
	- competition-level_code_generation
	datasets:
	- BAAI/TACO
	---
	# Starcoder-base-1B-TACO

	## Model Description

	Starcoder-base-1B-TACO is a Starcoder-base-1B finetuned(full-parameter) on TACO dataset. This model is specialized to solve competition-level programming tasks.

	## Training data

	The model is trained on the [Topics in Algorithmic Code Generation Dataset](https://github.com/FlagOpen/TACO). The dataset focused on algorithmic code generation, aiming to provide a more challenging training dataset and evaluation benchmark for the code generation model field. It includes 25,443 problems in the training set and 1,000 problems in the test set, making it the largest code generation dataset currently available. Each TACO problem is designed to match a diverse set of solution answers, with answers reaching sizes up to 1.55M, to ensure that models trained on this dataset are robust and not prone to overfitting. Furthermore, the TACO dataset includes fine-grained labels such as task topics, algorithms, skills, and difficulty levels, offering more precise guidance for both training and evaluating code generation models.
	This model is fine-tuned using train split of TACO.

	## Training procedure

	The training script used to train this model can be found [here](https://github.com/FlagOpen/TACO/blob/main/train.py).

	Training Details can be seen in our [paper](https://arxiv.org/abs/2312.14852)


	## Intended Use and Limitations

	The model is finetuned to solve programming problems given a text description and optional starter code.

	### How to use

	You can use this model directly with a pipeline for text generation. This example generates a different sequence each time it's run:

	```py
	from transformers import AutoModelForCausalLM, AutoTokenizer, FlaxAutoModelForCausalLM
	model = AutoModelForCausalLM.from_pretrained("flagopen/starcoderbase-1b-taco")
	tokenizer = AutoTokenizer.from_pretrained("flagopen/starcoderbase-1b-taco")
	prompt = """
	A function to greet user. Given a user name it should say hello
	def greet(name):
	ANSWER:
	"""
	input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)
	start = input_ids.size(1)
	out = model.generate(input_ids, do_sample=True, max_length=50, num_beams=2,
	early_stopping=True, eos_token_id=tokenizer.eos_token_id, )
	print(tokenizer.decode(out[0][start:]))
	```

	### Limitations and Biases

	The model is intended to be only used for research purposes and comes with no guarantees of quality of generated code.


	## Eval results

	Coming soon...