Idefics2-pretraining

#54

by orrzohar - opened 15 days ago

15 days ago

Hi,
There does not seem to be any support for pre-training.
When I try, there seems to be some instability with the Connector. How did you initialize your weights?

VictorSanh

HuggingFaceM4 org 15 days ago

Hi @orrzohar
can you say more about the instability you are seeing?
our initialization scheme for newly initialized parameters is rather standard. the code snippet below should give you a good idea:

        if isinstance(module, MLP):
            for sub_module_name, sub_module in module.named_modules():
                if isinstance(sub_module, nn.Linear):
                    factor = 1.0
                    if "down_proj" in sub_module_name:
                        factor = 2.0
                    init_a_linear(sub_module, std=(0.4 / (self.config.hidden_size * factor)) ** 0.5)

orrzohar

15 days ago

•

edited 15 days ago

Hi Victor,
Thank you for your response!

What I am seeing is that the loss initially decreases, but then NaN's are detected after the "connector" (MLP+Perceiver Pooler). I have tried xavier_uniform_/kaiming_uniform_ for all the connector whieghts -- but was unsuccessful.

I have tried the obvious -- varying batch sizes/learning rates (2-1000 and 1e-3-1e-6).

It is extremely regular -- seems to happen at the same iteration for the same batch size, no matter the learning rate. The only time this does not occur is when using batch size=1.

Have you ever experienced similar/how did you debug?
Best,
Orr

VictorSanh

HuggingFaceM4 org 12 days ago

indeed nan are never a good sign....
before I answer, a few question:

are you fine-tuning or training from scratch?
what data?
mixed precision? what precision?
is it specifically after the connector? any details as to where in the connector?

orrzohar

4 days ago

Hi @VictorSanh ,

I am training from scratch
LLaVA 1.5
BF16
It is usually in the MLP of the Idefics2PerceiverLayer, usually after "gate_proj", very rarely after "down_proj".
I tried your initialization code, increasing the batch size to 4096 and reducing lr to 1r-06, but with no luck. When interrogating the issue further, I noticed that the 'latents' remain all-ones even when training persists to a few 100 iterations. I am sure that the parameters are added to the optimizer. I tried randomly initializing those instead, but that did not solve the issue.

Best,
Orr

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment