MACE-MH-1 fine-tuning fails during remove_pt_head with ScaleShiftMACE size mismatch for both omat_pbe and omol heads
Description
I am trying to fine-tune the mace-mh-1 foundation model on my own CP2K-labeled PA66 dataset using the recommended foundation_model + foundation_head workflow.
Following the MACE-MH-1 best-practice recommendation, I first tried:
--foundation_model="/data/home/zhongjin/data_source/model/mace-mh-1.model"
--foundation_head="omat_pbe"
I also tried foundation_head="omol" as a diagnostic test, but the same error occurs.
The training fails before entering the actual training loop, during remove_pt_head(...), when MACE tries to extract the selected head from the multi-head MH-1 model. The error is a state_dict size mismatch in ScaleShiftMACE.
This appears similar to the issue reported in the HuggingFace MACE-MH-1 discussion about fine-tuning size mismatch:
https://huggingface.co/mace-foundations/mace-mh-1/discussions/1
I updated MACE by cloning the latest main branch from GitHub and installing it in editable mode, but the same error persists.
Environment
Cluster GPU node:
GPU: Tesla V100-SXM2-32GB
NVIDIA driver: 545.23.06
CUDA version shown by nvidia-smi: 12.3
Python environment used for the latest test:
python: /data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python
MACE file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
MACE version: 0.3.16
ASE: 3.28.0
PyTorch: tested with CUDA-compatible PyTorch after fixing an earlier cu130 / driver incompatibility
The MACE main branch was installed from source:
git clone https://github.com/ACEsuit/mace.git /data/home/zhongjin/data_source/software/mace_main_mh1
cd /data/home/zhongjin/data_source/software/mace_main_mh1
pip install -e .
I verified that the SLURM job was loading the new source installation:
mace file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
Foundation model
Local MH-1 model path:
/data/home/zhongjin/data_source/model/mace-mh-1.model
The model can be loaded and inspected. It appears to contain the expected heads:
model class: ScaleShiftMACE
heads: ['matpes_r2scan', 'mp_pbe_refit_add', 'spice_wB97M', 'oc20_usemppbe', 'omol', 'omat_pbe']
r_max: tensor(6., dtype=torch.float64)
num_interactions: 2
Dataset
My training data are CP2K-labeled PA66 configurations in extxyz format.
Dataset files:
/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/train.extxyz
/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/valid.extxyz
/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/test.extxyz
The first frame has:
symbols: ['C', 'H', 'N', 'O']
energy key in atoms.info: REF_energy
forces key in atoms.arrays: REF_forces
The dataset key sanity check passes:
has energy key REF_energy: True
has forces key REF_forces: True
For debugging, I used a smoke-test subset:
smoke_train.extxyz: 40 frames
smoke_valid.extxyz: 10 frames
Command used
The main command is:
python -m mace.cli.run_train \
--name="mh1_omat_pbe_pa66_smoke_mainenv" \
--foundation_model="/data/home/zhongjin/data_source/model/mace-mh-1.model" \
--foundation_head="omat_pbe" \
--train_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_train.extxyz" \
--valid_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz" \
--test_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz" \
--energy_key="REF_energy" \
--forces_key="REF_forces" \
--energy_weight=1.0 \
--forces_weight=100.0 \
--E0s="foundation" \
--lr=0.0005 \
--weight_decay=0.0 \
--ema \
--ema_decay=0.995 \
--amsgrad \
--clip_grad=10.0 \
--batch_size=1 \
--valid_batch_size=1 \
--max_num_epochs=1 \
--patience=8 \
--default_dtype=float64 \
--device=cuda \
--results_dir="/data/home/zhongjin/data_source/mh1_finetune/results_mh1_omat_pbe_pa66_smoke_mainenv" \
--seed=3 \
--multiheads_finetuning=False
I also tried the same setup with:
--foundation_head="omol"
The same type of error occurs.
Actual output
The run reaches:
Using foundation model /data/home/zhongjin/data_source/model/mace-mh-1.model as initial checkpoint.
Selecting the head omat_pbe as foundation head.
Then it fails inside remove_pt_head(...):
Traceback (most recent call last):
File "/data/home/zhongjin/data_source/software/mace_main_mh1/mace/cli/run_train.py", line 219, in run
model_foundation = remove_pt_head(
File "/data/home/zhongjin/data_source/software/mace_main_mh1/mace/tools/scripts_utils.py", line 494, in remove_pt_head
new_model.load_state_dict(new_state_dict, strict=False)
File "/data/home/zhongjin/.conda/envs/mace_mh1_main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
size mismatch for interactions.0.linear_up.weight: copying a param with shape torch.Size([65536]) from checkpoint, the shape in current model is torch.Size([262144]).
size mismatch for interactions.0.linear_up.output_mask: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for interactions.0.conv_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([8192]).
size mismatch for interactions.0.conv_tp_weights.net.9.weight: copying a param with shape torch.Size([512, 64]) from checkpoint, the shape in current model is torch.Size([2048, 64]).
size mismatch for interactions.0.conv_tp_weights.net.9.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for interactions.0.linear_res.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([1048576]).
size mismatch for interactions.0.linear_1.weight: copying a param with shape torch.Size([458752]) from checkpoint, the shape in current model is torch.Size([1835008]).
This happens before training starts, so changing learning rate, batch size, E0 mode, or LoRA settings does not seem relevant.
What I tried
Confirmed that --foundation_head is supported:
python -m mace.cli.run_train --help | grep foundation_head
Output:
[--foundation_head FOUNDATION_HEAD]
--foundation_head FOUNDATION_HEAD
Confirmed dataset keys are correct:
REF_energy exists in atoms.info
REF_forces exists in atoms.arrays
Tried foundation_head="omat_pbe".
Tried foundation_head="omol".
Installed MACE from GitHub main branch using:
git clone https://github.com/ACEsuit/mace.git
pip install -e .
Verified that the SLURM job uses the new MACE source path:
mace file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
Initially I also encountered a PyTorch CUDA issue because PyTorch had been installed as cu130, while the cluster driver only supports CUDA 12.x. I fixed that separately by using a CUDA-12-compatible PyTorch. After that, the current remaining error is the remove_pt_head / ScaleShiftMACE size mismatch.
Expected behavior
I expected MACE to extract the requested MH-1 head, e.g. omat_pbe, and start a one-epoch smoke fine-tuning run on my small CP2K-labeled extxyz dataset.
Question
Could you please clarify the correct way to fine-tune mace-mh-1 with a selected head?
Specifically:
Is remove_pt_head(...) expected to work for the public mace-mh-1.model file when using:
--foundation_model="/path/to/mace-mh-1.model"
--foundation_head="omat_pbe"
Should I use the model alias instead?
--foundation_model="mh-1"
--foundation_head="omat_pbe"
Is there a required model file format or checkpoint version for MH-1 fine-tuning?
Does this size mismatch indicate that the local mace-mh-1.model is incompatible with the current remove_pt_head implementation?
Is there a recommended command for a minimal one-epoch smoke fine-tuning test of MH-1 on a custom extxyz dataset with energy_key=REF_energy and forces_key=REF_forces?
If head extraction is not currently robust for this model, is there a recommended workaround, such as:
using foundation_model="mh-1" instead of a local path,
using multi-head finetuning directly without removing the pre-training heads,
using a specific commit,
or using a provided script to extract the desired head?
Thank you.
Thanks. I tested the direct alias path as you suggested.
I used the latest main-branch MACE source installation and confirmed that the SLURM job was loading MACE from:
/data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
The environment was:
Python: /data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python
PyTorch: 2.5.1+cu121
CUDA available: True
GPU: Tesla V100-SXM2-32GB
MACE version: 0.3.16, loaded from the main-branch source path above
I then tested the alias path:
--foundation_model="mh-1"
--foundation_head="omat_pbe"
The alias calculator loader works successfully:
Using Materials Project MACE for MACECalculator with /data/home/zhongjin/.cache/mace/macemh1model
SUCCESS: mace_mp alias calculator loaded
However, the actual run_train fine-tuning path still fails during remove_pt_head, with the same ScaleShiftMACE size mismatch. The smoke-test command was:
/data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python -m mace.cli.run_train \
--name=mh1_alias_omat_pbe_smoke \
--foundation_model=mh-1 \
--foundation_head=omat_pbe \
--train_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_train.extxyz \
--valid_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz \
--test_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz \
--energy_key=REF_energy \
--forces_key=REF_forces \
--energy_weight=1.0 \
--forces_weight=100.0 \
--E0s=foundation \
--lr=0.0005 \
--weight_decay=0.0 \
--ema \
--ema_decay=0.995 \
--amsgrad \
--clip_grad=10.0 \
--batch_size=1 \
--valid_batch_size=1 \
--max_num_epochs=1 \
--patience=8 \
--default_dtype=float64 \
--device=cuda \
--results_dir=/data/home/zhongjin/data_source/mh1_finetune/results_mh1_alias_omat_pbe_smoke \
--seed=3 \
--multiheads_finetuning=False
The run reaches:
Using foundation model mace mh-1 as initial checkpoint.
Using head omat_pbe out of ['matpes_r2scan', 'mp_pbe_refit_add', 'spice_wB97M', 'oc20_usemppbe', 'omol', 'omat_pbe']
Selecting the head omat_pbe as foundation head.
Then it fails at:
File ".../mace/cli/run_train.py", line 219, in run
model_foundation = remove_pt_head(...)
File ".../mace/tools/scripts_utils.py", line 494, in remove_pt_head
new_model.load_state_dict(new_state_dict, strict=False)
RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
size mismatch for interactions.0.linear_up.weight: copying a param with shape torch.Size([65536]) from checkpoint, the shape in current model is torch.Size([262144]).
size mismatch for interactions.0.linear_up.output_mask: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for interactions.0.conv_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([8192]).
size mismatch for interactions.0.conv_tp_weights.net.9.weight: copying a param with shape torch.Size([512, 64]) from checkpoint, the shape in current model is torch.Size([2048, 64]).
size mismatch for interactions.0.conv_tp_weights.net.9.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for interactions.0.linear_res.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([1048576]).
size mismatch for interactions.0.linear_1.weight: copying a param with shape torch.Size([458752]) from checkpoint, the shape in current model is torch.Size([1835008]).
So the current status is:
- Local model path fails.
- Direct alias
"mh-1"also fails inrun_train. - The alias calculator loads successfully.
- The failure only occurs when
run_traincallsremove_pt_head. - The failure happens before training starts, so it does not seem related to the dataset, learning rate, E0s, LoRA, or batch size.
Could you please advise whether there is a different recommended command for fine-tuning MH-1 with a selected head, or whether remove_pt_head currently requires additional architecture arguments for this MH-1 checkpoint?
Should I be using a different fine-tuning mode, for example --multiheads_finetuning=True, instead of the single-head extraction path?