Do you have the training loss log that can use for reference?

#51
by junliu44 - opened

I want to do a training test for benchmark.

this is the training loss of StarCoderBase, so that represents the first 1T tokens. The extra 35B tokens of fine-tuning StarCoderBase on Python to get StarCoder aren't included.
image.png

junliu44 changed discussion status to closed

@loubnabnl thank you very much!

BTW, I have another detail question:
For the training data set, does Starcoder use document-level sampling or sampling based on context length segmentation?

e.g. | code file example 1 | code file example 2 | ...... | or | ctx_length code snippet 1 | ctx_length code snippet 2 | ...... |

junliu44 changed discussion status to open
BigCode org

we do sequence packing where tokenized documents are concatenated and separated by eos_token, then split to 8192 sequences that we sample from

loubnabnl changed discussion status to closed

Sign up or log in to comment