Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

Idefics2-pretraining

#54
by orrzohar - opened

Hi,
There does not seem to be any support for pre-training.
When I try, there seems to be some instability with the Connector. How did you initialize your weights?

HuggingFaceM4 org

Hi @orrzohar
can you say more about the instability you are seeing?
our initialization scheme for newly initialized parameters is rather standard. the code snippet below should give you a good idea:

        if isinstance(module, MLP):
            for sub_module_name, sub_module in module.named_modules():
                if isinstance(sub_module, nn.Linear):
                    factor = 1.0
                    if "down_proj" in sub_module_name:
                        factor = 2.0
                    init_a_linear(sub_module, std=(0.4 / (self.config.hidden_size * factor)) ** 0.5)

Hi Victor,
Thank you for your response!

What I am seeing is that the loss initially decreases, but then NaN's are detected after the "connector" (MLP+Perceiver Pooler). I have tried xavier_uniform_/kaiming_uniform_ for all the connector whieghts -- but was unsuccessful.

I have tried the obvious -- varying batch sizes/learning rates (2-1000 and 1e-3-1e-6).

It is extremely regular -- seems to happen at the same iteration for the same batch size, no matter the learning rate. The only time this does not occur is when using batch size=1.

Have you ever experienced similar/how did you debug?
Best,
Orr

HuggingFaceM4 org

indeed nan are never a good sign....
before I answer, a few question:

  • are you fine-tuning or training from scratch?
  • what data?
  • mixed precision? what precision?
  • is it specifically after the connector? any details as to where in the connector?

Hi @VictorSanh ,

  • I am training from scratch
  • LLaVA 1.5
  • BF16
  • It is usually in the MLP of the Idefics2PerceiverLayer, usually after "gate_proj", very rarely after "down_proj".
    I tried your initialization code, increasing the batch size to 4096 and reducing lr to 1r-06, but with no luck. When interrogating the issue further, I noticed that the 'latents' remain all-ones even when training persists to a few 100 iterations. I am sure that the parameters are added to the optimizer. I tried randomly initializing those instead, but that did not solve the issue.

Best,
Orr

Sign up or log in to comment