CodeCarbon wasn't ready until the training was over so we only did an additional 10h run to measure with and then we can extrapolate to the whole training.
This set of records captures the startup time and 2499 iterations in 2 records per gpu, since there was also an intermediary checkpoint saved half-way and we flush the CC records on each checkpoint saving.
The training had 168000 iterations. Therefore multiply the reported data by 67. This would be quite approximate since we were using 16 nodes when doing the ramp up, then 64 and only the last 3 weeks 128 nodes.
Caveat emptor: I'm not sure whether CC-reports overlap since each report is per gpu and I think they may be measuring the same thing, other than the gpu itself. So this requires research.
Each csv file contains a report for a single gpu.
- Downloads last month
Unable to determine this model’s pipeline type. Check the docs .