MACE-MH-1 fine-tuning fails during remove_pt_head with ScaleShiftMACE size mismatch for both omat_pbe and omol heads

by jin309 - opened 20 days ago

Description

I am trying to fine-tune the mace-mh-1 foundation model on my own CP2K-labeled PA66 dataset using the recommended foundation_model + foundation_head workflow.

Following the MACE-MH-1 best-practice recommendation, I first tried:

--foundation_model="/data/home/zhongjin/data_source/model/mace-mh-1.model"
--foundation_head="omat_pbe"

I also tried foundation_head="omol" as a diagnostic test, but the same error occurs.

The training fails before entering the actual training loop, during remove_pt_head(...), when MACE tries to extract the selected head from the multi-head MH-1 model. The error is a state_dict size mismatch in ScaleShiftMACE.

This appears similar to the issue reported in the HuggingFace MACE-MH-1 discussion about fine-tuning size mismatch:
https://huggingface.co/mace-foundations/mace-mh-1/discussions/1

I updated MACE by cloning the latest main branch from GitHub and installing it in editable mode, but the same error persists.
Environment

Cluster GPU node:

GPU: Tesla V100-SXM2-32GB
NVIDIA driver: 545.23.06
CUDA version shown by nvidia-smi: 12.3

Python environment used for the latest test:

python: /data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python
MACE file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
MACE version: 0.3.16
ASE: 3.28.0
PyTorch: tested with CUDA-compatible PyTorch after fixing an earlier cu130 / driver incompatibility

The MACE main branch was installed from source:

git clone https://github.com/ACEsuit/mace.git /data/home/zhongjin/data_source/software/mace_main_mh1
cd /data/home/zhongjin/data_source/software/mace_main_mh1
pip install -e .

I verified that the SLURM job was loading the new source installation:

mace file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
Foundation model

Local MH-1 model path:

/data/home/zhongjin/data_source/model/mace-mh-1.model

The model can be loaded and inspected. It appears to contain the expected heads:

model class: ScaleShiftMACE
heads: ['matpes_r2scan', 'mp_pbe_refit_add', 'spice_wB97M', 'oc20_usemppbe', 'omol', 'omat_pbe']
r_max: tensor(6., dtype=torch.float64)
num_interactions: 2
Dataset

My training data are CP2K-labeled PA66 configurations in extxyz format.

Dataset files:

/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/train.extxyz
/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/valid.extxyz
/data/home/zhongjin/data_source/omol_data_source/mace_labeled_dataset_from_message/test.extxyz

The first frame has:

symbols: ['C', 'H', 'N', 'O']
energy key in atoms.info: REF_energy
forces key in atoms.arrays: REF_forces

The dataset key sanity check passes:

has energy key REF_energy: True
has forces key REF_forces: True

For debugging, I used a smoke-test subset:

smoke_train.extxyz: 40 frames
smoke_valid.extxyz: 10 frames
Command used

The main command is:

python -m mace.cli.run_train \
    --name="mh1_omat_pbe_pa66_smoke_mainenv" \
    --foundation_model="/data/home/zhongjin/data_source/model/mace-mh-1.model" \
    --foundation_head="omat_pbe" \
    --train_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_train.extxyz" \
    --valid_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz" \
    --test_file="/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz" \
    --energy_key="REF_energy" \
    --forces_key="REF_forces" \
    --energy_weight=1.0 \
    --forces_weight=100.0 \
    --E0s="foundation" \
    --lr=0.0005 \
    --weight_decay=0.0 \
    --ema \
    --ema_decay=0.995 \
    --amsgrad \
    --clip_grad=10.0 \
    --batch_size=1 \
    --valid_batch_size=1 \
    --max_num_epochs=1 \
    --patience=8 \
    --default_dtype=float64 \
    --device=cuda \
    --results_dir="/data/home/zhongjin/data_source/mh1_finetune/results_mh1_omat_pbe_pa66_smoke_mainenv" \
    --seed=3 \
    --multiheads_finetuning=False

I also tried the same setup with:

--foundation_head="omol"

The same type of error occurs.

Actual output

The run reaches:

Using foundation model /data/home/zhongjin/data_source/model/mace-mh-1.model as initial checkpoint.
Selecting the head omat_pbe as foundation head.

Then it fails inside remove_pt_head(...):

Traceback (most recent call last):
  File "/data/home/zhongjin/data_source/software/mace_main_mh1/mace/cli/run_train.py", line 219, in run
    model_foundation = remove_pt_head(
  File "/data/home/zhongjin/data_source/software/mace_main_mh1/mace/tools/scripts_utils.py", line 494, in remove_pt_head
    new_model.load_state_dict(new_state_dict, strict=False)
  File "/data/home/zhongjin/.conda/envs/mace_mh1_main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
    size mismatch for interactions.0.linear_up.weight: copying a param with shape torch.Size([65536]) from checkpoint, the shape in current model is torch.Size([262144]).
    size mismatch for interactions.0.linear_up.output_mask: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
    size mismatch for interactions.0.conv_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([8192]).
    size mismatch for interactions.0.conv_tp_weights.net.9.weight: copying a param with shape torch.Size([512, 64]) from checkpoint, the shape in current model is torch.Size([2048, 64]).
    size mismatch for interactions.0.conv_tp_weights.net.9.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([2048]).
    size mismatch for interactions.0.linear_res.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([1048576]).
    size mismatch for interactions.0.linear_1.weight: copying a param with shape torch.Size([458752]) from checkpoint, the shape in current model is torch.Size([1835008]).

This happens before training starts, so changing learning rate, batch size, E0 mode, or LoRA settings does not seem relevant.

What I tried
Confirmed that --foundation_head is supported:
python -m mace.cli.run_train --help | grep foundation_head

Output:

[--foundation_head FOUNDATION_HEAD]
--foundation_head FOUNDATION_HEAD
Confirmed dataset keys are correct:
REF_energy exists in atoms.info
REF_forces exists in atoms.arrays
Tried foundation_head="omat_pbe".
Tried foundation_head="omol".
Installed MACE from GitHub main branch using:
git clone https://github.com/ACEsuit/mace.git
pip install -e .
Verified that the SLURM job uses the new MACE source path:
mace file: /data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py
Initially I also encountered a PyTorch CUDA issue because PyTorch had been installed as cu130, while the cluster driver only supports CUDA 12.x. I fixed that separately by using a CUDA-12-compatible PyTorch. After that, the current remaining error is the remove_pt_head / ScaleShiftMACE size mismatch.
Expected behavior

I expected MACE to extract the requested MH-1 head, e.g. omat_pbe, and start a one-epoch smoke fine-tuning run on my small CP2K-labeled extxyz dataset.

Question

Could you please clarify the correct way to fine-tune mace-mh-1 with a selected head?

Specifically:

Is remove_pt_head(...) expected to work for the public mace-mh-1.model file when using:
--foundation_model="/path/to/mace-mh-1.model"
--foundation_head="omat_pbe"
Should I use the model alias instead?
--foundation_model="mh-1"
--foundation_head="omat_pbe"
Is there a required model file format or checkpoint version for MH-1 fine-tuning?
Does this size mismatch indicate that the local mace-mh-1.model is incompatible with the current remove_pt_head implementation?
Is there a recommended command for a minimal one-epoch smoke fine-tuning test of MH-1 on a custom extxyz dataset with energy_key=REF_energy and forces_key=REF_forces?
If head extraction is not currently robust for this model, is there a recommended workaround, such as:
using foundation_model="mh-1" instead of a local path,
using multi-head finetuning directly without removing the pre-training heads,
using a specific commit,
or using a provided script to extract the desired head?

Thank you.

ilyesb

MACE foundation models org 20 days ago

Hey @jin309 ,

When you use the alias "mh-1" directly does it work? Maybe you have an older version of the model file.

jin309 changed discussion status to closed 19 days ago

jin309 changed discussion status to open 19 days ago

jin309

18 days ago

•

edited 18 days ago

Thanks. I tested the direct alias path as you suggested.

I used the latest main-branch MACE source installation and confirmed that the SLURM job was loading MACE from:

/data/home/zhongjin/data_source/software/mace_main_mh1/mace/__init__.py

The environment was:

Python: /data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python
PyTorch: 2.5.1+cu121
CUDA available: True
GPU: Tesla V100-SXM2-32GB
MACE version: 0.3.16, loaded from the main-branch source path above

I then tested the alias path:

--foundation_model="mh-1"
--foundation_head="omat_pbe"

The alias calculator loader works successfully:

Using Materials Project MACE for MACECalculator with /data/home/zhongjin/.cache/mace/macemh1model
SUCCESS: mace_mp alias calculator loaded

However, the actual run_train fine-tuning path still fails during remove_pt_head, with the same ScaleShiftMACE size mismatch. The smoke-test command was:

/data/home/zhongjin/.conda/envs/mace_mh1_main/bin/python -m mace.cli.run_train \
  --name=mh1_alias_omat_pbe_smoke \
  --foundation_model=mh-1 \
  --foundation_head=omat_pbe \
  --train_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_train.extxyz \
  --valid_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz \
  --test_file=/data/home/zhongjin/data_source/mh1_finetune/smoke_valid.extxyz \
  --energy_key=REF_energy \
  --forces_key=REF_forces \
  --energy_weight=1.0 \
  --forces_weight=100.0 \
  --E0s=foundation \
  --lr=0.0005 \
  --weight_decay=0.0 \
  --ema \
  --ema_decay=0.995 \
  --amsgrad \
  --clip_grad=10.0 \
  --batch_size=1 \
  --valid_batch_size=1 \
  --max_num_epochs=1 \
  --patience=8 \
  --default_dtype=float64 \
  --device=cuda \
  --results_dir=/data/home/zhongjin/data_source/mh1_finetune/results_mh1_alias_omat_pbe_smoke \
  --seed=3 \
  --multiheads_finetuning=False

The run reaches:

Using foundation model mace mh-1 as initial checkpoint.
Using head omat_pbe out of ['matpes_r2scan', 'mp_pbe_refit_add', 'spice_wB97M', 'oc20_usemppbe', 'omol', 'omat_pbe']
Selecting the head omat_pbe as foundation head.

Then it fails at:

File ".../mace/cli/run_train.py", line 219, in run
    model_foundation = remove_pt_head(...)

File ".../mace/tools/scripts_utils.py", line 494, in remove_pt_head
    new_model.load_state_dict(new_state_dict, strict=False)

RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
    size mismatch for interactions.0.linear_up.weight: copying a param with shape torch.Size([65536]) from checkpoint, the shape in current model is torch.Size([262144]).
    size mismatch for interactions.0.linear_up.output_mask: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
    size mismatch for interactions.0.conv_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([8192]).
    size mismatch for interactions.0.conv_tp_weights.net.9.weight: copying a param with shape torch.Size([512, 64]) from checkpoint, the shape in current model is torch.Size([2048, 64]).
    size mismatch for interactions.0.conv_tp_weights.net.9.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([2048]).
    size mismatch for interactions.0.linear_res.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([1048576]).
    size mismatch for interactions.0.linear_1.weight: copying a param with shape torch.Size([458752]) from checkpoint, the shape in current model is torch.Size([1835008]).

So the current status is:

Local model path fails.
Direct alias "mh-1" also fails in run_train.
The alias calculator loads successfully.
The failure only occurs when run_train calls remove_pt_head.
The failure happens before training starts, so it does not seem related to the dataset, learning rate, E0s, LoRA, or batch size.

Could you please advise whether there is a different recommended command for fine-tuning MH-1 with a selected head, or whether remove_pt_head currently requires additional architecture arguments for this MH-1 checkpoint?

Should I be using a different fine-tuning mode, for example --multiheads_finetuning=True, instead of the single-head extraction path?

jin309

18 days ago

This comment has been hidden (marked as Spam)

jin309 changed discussion status to closed 18 days ago

jin309 changed discussion status to open 18 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment