Do you have the training loss log that can use for reference?

#51

by junliu44 - opened Jun 8, 2023

Discussion

junliu44

Jun 8, 2023

I want to do a training test for benchmark.

loubnabnl

BigCode org Jun 15, 2023

•

edited Jun 15, 2023

this is the training loss of StarCoderBase, so that represents the first 1T tokens. The extra 35B tokens of fine-tuning StarCoderBase on Python to get StarCoder aren't included.

junliu44 changed discussion status to closed Jun 16, 2023

junliu44

Jun 16, 2023

•

edited Jun 17, 2023

@loubnabnl thank you very much!

BTW, I have another detail question:
For the training data set, does Starcoder use document-level sampling or sampling based on context length segmentation?

e.g. | code file example 1 | code file example 2 | ...... | or | ctx_length code snippet 1 | ctx_length code snippet 2 | ...... |

junliu44 changed discussion status to open Jun 17, 2023

loubnabnl

BigCode org Jun 19, 2023

we do sequence packing where tokenized documents are concatenated and separated by eos_token, then split to 8192 sequences that we sample from

loubnabnl changed discussion status to closed Oct 5, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment