Can this model be improved through RLAIF?

#1
by FlameF0X - opened

First of all, congratulations for making this model. Sounds Interesting to play with it.

Second of all, can't be used to reward LM in fine tuning via RL? Like, take this model and try to find tune it using RLAIF where the feedback comes from the same model. Maybe it can make the model still could and feel like fable while making it smart?

Now I'm just hypnotising since I don't have any idea how long or expensive it would be.

(Sorry for my bad English)

FlameF0X changed discussion title from Can this model be user for RLAIF? to Can this model be improved through RLAIF?
TeichAI org

No worries! I was actually planning on using the Fable 5 data to do SDFT (Self-Distillation Fine Tuning) which is a very similar concept. I don't really know what RLAIF is, but I'm currently doing a test run with the 9B to find some good parameters before committing to a more expensive run of SDFT.

More info on SDFT: https://arxiv.org/pdf/2601.19897

I do think certain aspects of it can be used for reward, maybe not for code quality, but for implementation planning and things like that definitely.

RLAIF is RL from AI Feedback (https://arxiv.org/abs/2309.00267), it's like RLHF but ai does the whole work.

Also I never heard of SDFT, but sounds neat, I'm sure I'm going to check it out.

TeichAI org

And I will do the same for RLAIF! Cheers :)

First of all, congratulations for making this model. Sounds Interesting to play with it.

Second of all, can't be used to reward LM in fine tuning via RL? Like, take this model and try to find tune it using RLAIF where the feedback comes from the same model. Maybe it can make the model still could and feel like fable while making it smart?

Now I'm just hypnotising since I don't have any idea how long or expensive it would be.

(Sorry for my bad English)

Any type of RL is compute expensive far more expensive than sft, but RL is how models can really improve themselves. Reinforcement learnings only limit is the amount of money and time you have

I know that money are a constraint for RL. I was just suggesting for the far future.

Sign up or log in to comment