I experienced that the GRPO from TRL is very memory-consuming. There are already various alternative implementations out there that seem much faster and more lightweight. Unsloth is promoting this with a factor of 10 less memory! This is insane. Can we potentially expect something similar for the TRL implementation in the near future?
I have combined the RL gym lib with GRPO here to see if you can teach a small model to drive taxi. This already took around 70gb for the 1.5b model.
BTW: The RL gym lib could be potentially helpful for new/better reasoning models (and new benchmarks)?