MACE-MH-1 fine-tuning fails during remove_pt_head with ScaleShiftMACE size mismatch for both omat_pbe and omol heads

#4
by jin309 - opened

Description

I am trying to fine-tune the mace-mh-1 foundation model on my own CP2K-labeled PA66 dataset using the recommended foundation_model + foundation_head workflow.

Following the MACE-MH-1 best-practice recommendation, I first tried:

--foundation_model="/data/home/zhongjin/data_source/model/mace-mh-1.model"
--foundation_head="omat_pbe"

I also tried foundation_head="omol" as a diagnostic test, but the same error occurs.

The training fails before entering the actual training loop, during remove_pt_head(...), when MACE tries to extract the selected head from the multi-head MH-1 model. The error is a state_dict size mismatch in ScaleShiftMACE.

This appears similar to the issue reported in the HuggingFace MACE-MH-1 discussion about fine-tuning size mismatch:
https://huggingface.co/mace-foundations/mace-mh-1/discussions/1

I updated MACE by cloning the latest main branch from GitHub and installing it in editable mode, but the same error persists.
Environment

Cluster GPU node:

GPU: Tesla V100-SXM2-32GB
NVIDIA driver: 545.23.06
CUDA version shown by nvidia-smi: 12.3

Python environment used for the latest test:

python: /data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python
MACE file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
MACE version: 0.3.16
ASE: 3.28.0
PyTorch: tested with CUDA-compatible PyTorch after fixing an earlier cu130 / driver incompatibility

The MACE main branch was installed from source:

git clone https://github.com/ACEsuit/mace.git /data/home/zhongjin/data_source/software/mace_main_mh1
cd /data/home/zhongjin/data_source/software/mace_main_mh1
pip install -e .

I verified that the SLURM job was loading the new source installation:

mace file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
Foundation model

Local MH-1 model path:

/data/home/zhongjin/data_source/model/mace-mh-1.model

The model can be loaded and inspected. It appears to contain the expected heads:

model class: ScaleShiftMACE
heads: ['matpes_r2scan', 'mp_pbe_refit_add', 'spice_wB97M', 'oc20_usemppbe', 'omol', 'omat_pbe']
r_max: tensor(6., dtype=torch.float64)
num_interactions: 2
Dataset

My training data are CP2K-labeled PA66 configurations in extxyz format.

Dataset files:

/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/train.extxyz
/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/valid.extxyz
/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/test.extxyz

The first frame has:

symbols: ['C', 'H', 'N', 'O']
energy key in atoms.info: REF_energy
forces key in atoms.arrays: REF_forces

The dataset key sanity check passes:

has energy key REF_energy: True
has forces key REF_forces: True

For debugging, I used a smoke-test subset:

smoke_train.extxyz: 40 frames
smoke_valid.extxyz: 10 frames
Command used

The main command is:

python -m mace.cli.run_train \
    --name="mh1_omat_pbe_pa66_smoke_mainenv" \
    --foundation_model="/data/home/zhongjin/data_source/model/mace-mh-1.model" \
    --foundation_head="omat_pbe" \
    --train_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_train.extxyz" \
    --valid_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz" \
    --test_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz" \
    --energy_key="REF_energy" \
    --forces_key="REF_forces" \
    --energy_weight=1.0 \
    --forces_weight=100.0 \
    --E0s="foundation" \
    --lr=0.0005 \
    --weight_decay=0.0 \
    --ema \
    --ema_decay=0.995 \
    --amsgrad \
    --clip_grad=10.0 \
    --batch_size=1 \
    --valid_batch_size=1 \
    --max_num_epochs=1 \
    --patience=8 \
    --default_dtype=float64 \
    --device=cuda \
    --results_dir="/data/home/zhongjin/data_source/mh1_finetune/results_mh1_omat_pbe_pa66_smoke_mainenv" \
    --seed=3 \
    --multiheads_finetuning=False

I also tried the same setup with:

--foundation_head="omol"

The same type of error occurs.

Actual output

The run reaches:

Using foundation model /data/home/zhongjin/data_source/model/mace-mh-1.model as initial checkpoint.
Selecting the head omat_pbe as foundation head.

Then it fails inside remove_pt_head(...):

Traceback (most recent call last):
  File "/data/home/zhongjin/data_source/software/mace_main_mh1/mace/cli/run_train.py", line 219, in run
    model_foundation = remove_pt_head(
  File "/data/home/zhongjin/data_source/software/mace_main_mh1/mace/tools/scripts_utils.py", line 494, in remove_pt_head
    new_model.load_state_dict(new_state_dict, strict=False)
  File "/data/home/zhongjin/.conda/envs/mace_mh1_main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
    size mismatch for interactions.0.linear_up.weight: copying a param with shape torch.Size([65536]) from checkpoint, the shape in current model is torch.Size([262144]).
    size mismatch for interactions.0.linear_up.output_mask: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
    size mismatch for interactions.0.conv_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([8192]).
    size mismatch for interactions.0.conv_tp_weights.net.9.weight: copying a param with shape torch.Size([512, 64]) from checkpoint, the shape in current model is torch.Size([2048, 64]).
    size mismatch for interactions.0.conv_tp_weights.net.9.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([2048]).
    size mismatch for interactions.0.linear_res.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([1048576]).
    size mismatch for interactions.0.linear_1.weight: copying a param with shape torch.Size([458752]) from checkpoint, the shape in current model is torch.Size([1835008]).

This happens before training starts, so changing learning rate, batch size, E0 mode, or LoRA settings does not seem relevant.

What I tried
Confirmed that --foundation_head is supported:
python -m mace.cli.run_train --help | grep foundation_head

Output:

[--foundation_head FOUNDATION_HEAD]
--foundation_head FOUNDATION_HEAD
Confirmed dataset keys are correct:
REF_energy exists in atoms.info
REF_forces exists in atoms.arrays
Tried foundation_head="omat_pbe".
Tried foundation_head="omol".
Installed MACE from GitHub main branch using:
git clone https://github.com/ACEsuit/mace.git
pip install -e .
Verified that the SLURM job uses the new MACE source path:
mace file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
Initially I also encountered a PyTorch CUDA issue because PyTorch had been installed as cu130, while the cluster driver only supports CUDA 12.x. I fixed that separately by using a CUDA-12-compatible PyTorch. After that, the current remaining error is the remove_pt_head / ScaleShiftMACE size mismatch.
Expected behavior

I expected MACE to extract the requested MH-1 head, e.g. omat_pbe, and start a one-epoch smoke fine-tuning run on my small CP2K-labeled extxyz dataset.

Question

Could you please clarify the correct way to fine-tune mace-mh-1 with a selected head?

Specifically:

Is remove_pt_head(...) expected to work for the public mace-mh-1.model file when using:
--foundation_model="/path/to/mace-mh-1.model"
--foundation_head="omat_pbe"
Should I use the model alias instead?
--foundation_model="mh-1"
--foundation_head="omat_pbe"
Is there a required model file format or checkpoint version for MH-1 fine-tuning?
Does this size mismatch indicate that the local mace-mh-1.model is incompatible with the current remove_pt_head implementation?
Is there a recommended command for a minimal one-epoch smoke fine-tuning test of MH-1 on a custom extxyz dataset with energy_key=REF_energy and forces_key=REF_forces?
If head extraction is not currently robust for this model, is there a recommended workaround, such as:
using foundation_model="mh-1" instead of a local path,
using multi-head finetuning directly without removing the pre-training heads,
using a specific commit,
or using a provided script to extract the desired head?

Thank you.
MACE foundation models org

Hey @jin309 ,

When you use the alias "mh-1" directly does it work? Maybe you have an older version of the model file.

jin309 changed discussion status to closed
jin309 changed discussion status to open

Thanks. I tested the direct alias path as you suggested.

I used the latest main-branch MACE source installation and confirmed that the SLURM job was loading MACE from:

/data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py

The environment was:

Python: /data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python
PyTorch: 2.5.1+cu121
CUDA available: True
GPU: Tesla V100-SXM2-32GB
MACE version: 0.3.16, loaded from the main-branch source path above

I then tested the alias path:

--foundation_model="mh-1"
--foundation_head="omat_pbe"

The alias calculator loader works successfully:

Using Materials Project MACE for MACECalculator with /data/home/zhongjin/.cache/mace/macemh1model
SUCCESS: mace_mp alias calculator loaded

However, the actual run_train fine-tuning path still fails during remove_pt_head, with the same ScaleShiftMACE size mismatch. The smoke-test command was:

/data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python -m mace.cli.run_train \
  --name=mh1_alias_omat_pbe_smoke \
  --foundation_model=mh-1 \
  --foundation_head=omat_pbe \
  --train_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_train.extxyz \
  --valid_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz \
  --test_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz \
  --energy_key=REF_energy \
  --forces_key=REF_forces \
  --energy_weight=1.0 \
  --forces_weight=100.0 \
  --E0s=foundation \
  --lr=0.0005 \
  --weight_decay=0.0 \
  --ema \
  --ema_decay=0.995 \
  --amsgrad \
  --clip_grad=10.0 \
  --batch_size=1 \
  --valid_batch_size=1 \
  --max_num_epochs=1 \
  --patience=8 \
  --default_dtype=float64 \
  --device=cuda \
  --results_dir=/data/home/zhongjin/data_source/mh1_finetune/results_mh1_alias_omat_pbe_smoke \
  --seed=3 \
  --multiheads_finetuning=False

The run reaches:

Using foundation model mace mh-1 as initial checkpoint.
Using head omat_pbe out of ['matpes_r2scan', 'mp_pbe_refit_add', 'spice_wB97M', 'oc20_usemppbe', 'omol', 'omat_pbe']
Selecting the head omat_pbe as foundation head.

Then it fails at:

File ".../mace/cli/run_train.py", line 219, in run
    model_foundation = remove_pt_head(...)

File ".../mace/tools/scripts_utils.py", line 494, in remove_pt_head
    new_model.load_state_dict(new_state_dict, strict=False)

RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
    size mismatch for interactions.0.linear_up.weight: copying a param with shape torch.Size([65536]) from checkpoint, the shape in current model is torch.Size([262144]).
    size mismatch for interactions.0.linear_up.output_mask: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
    size mismatch for interactions.0.conv_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([8192]).
    size mismatch for interactions.0.conv_tp_weights.net.9.weight: copying a param with shape torch.Size([512, 64]) from checkpoint, the shape in current model is torch.Size([2048, 64]).
    size mismatch for interactions.0.conv_tp_weights.net.9.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([2048]).
    size mismatch for interactions.0.linear_res.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([1048576]).
    size mismatch for interactions.0.linear_1.weight: copying a param with shape torch.Size([458752]) from checkpoint, the shape in current model is torch.Size([1835008]).

So the current status is:

  1. Local model path fails.
  2. Direct alias "mh-1" also fails in run_train.
  3. The alias calculator loads successfully.
  4. The failure only occurs when run_train calls remove_pt_head.
  5. The failure happens before training starts, so it does not seem related to the dataset, learning rate, E0s, LoRA, or batch size.

Could you please advise whether there is a different recommended command for fine-tuning MH-1 with a selected head, or whether remove_pt_head currently requires additional architecture arguments for this MH-1 checkpoint?

Should I be using a different fine-tuning mode, for example --multiheads_finetuning=True, instead of the single-head extraction path?

This comment has been hidden (marked as Spam)
jin309 changed discussion status to closed
jin309 changed discussion status to open

Sign up or log in to comment