This is the Chameleon-7b checkpoint, converted using the script convert_chameleon_weights_to_hf.py from the Lumina-mGPT repository.
This release is intended to ease the initialization of Lumina-mGPT training. Before using this model, please ensure you have obtained permission to access the official Chameleon checkpoints available at Hugging Face. Usage of this model is at the user's own risk.
Differences from the official chameleon-7B release
This model is almost the same as the official chameleon-7B release, with one important difference in the qk-norm implementation:
Due to unknown reasons, for the 34B Chameleon
model, where 8-way model parallelism is employed during training, the weights in the qk-norm layers, which are expected to be the same across model-parallel ranks,
are found to be different (See here for details). More intuitively, this means that the attention heads can be divided into 1 group for 7B model and 8 groups for 34B model, where the qk-norm parameters
are the same within the groups but different among them. To mitigate this problem, transformers
has developed the implementation to copy the qk-norm parameters to the shape num_heads * head_dim
,
however, this means that if we want to further finetune the Chameleon model, like the case of Lumina-mGPT, the qk-norm parameters will further diverge to the extent that the parameters are different
between every two attention heads, which is not ideal. To solve this problem, we slightly change the implementation so that the qk-norm parameters are instead of shape model_parallel_size x head_dim
,
where model_parallel_size
is 1 for 7B model and 8 for 34B model, and they are expanded to num_heads * head_dim
during forward time through repeat_interleave
. This modification ensures
that the qk-norm parameters can always be consistent within existing groups.
- Downloads last month
- 64