Multimodal R1

#10
by salma-remyx - opened

"... we don’t want to stop at math datasets."

Right on, I'm experimenting with VLM fine-tunes using R1 distillations as the base llm to see if it's CoT reasoning can improve spatial reasoning.

This synthetic dataset uses a pipeline of models to infer distances and spatial relationships in a scene: https://huggingface.co/datasets/remyxai/OpenSpaces

Each image sample includes 5 QA pairs sampled from 40 templates.
Can the model learn to use relationships about different objects in a scene to reason about the best answer to the question.

User: What is the distance between the lamp and the chair?

Assistant: Let me solve this step by step.
<think>
The height of the lamp is X.
The sofa is to the left of the painting.
...
</think>
<ansewr>5.3 meters</answer>

Thoughts from the community about restructuring the dataset samples to use the context of 4 QA pairs to reason about the last one?

Here, I make r1-style reasoning with tags by using another AI to rephrase the information to justify the provided answer after resolving the fact set of the remaining QA pairs

image.png

image.png

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment