Error in using gene prediction

#31
by iLOVE2D - opened

Hi, I meet an error when using the gene prediction example jupyter notebook.

ValueError: --load_best_model_at_end requires the save and eval strategy to match, but found

  • Evaluation strategy: no
  • Save strategy: steps

It comes from:
Cell In[20], line 110, in cross_validate(data, targets, labels, nsplits, subsample_size, training_args, freeze_layers, output_dir, num_proc)
108 # add output directory to training args and initiate
109 training_args["output_dir"] = ksplit_output_dir
--> 110 training_args_init = TrainingArguments(**training_args)
112 # create the trainer
113 trainer = Trainer(
114 model=model,
115 args=training_args_init,
(...)
118 eval_dataset=evalset_train_labeled
119 )

It seems that the argument is not correct. Could you please address it? Thanks a lot.

Can you please tell me what TrainingArguments you are using? Are they the same as the ones in the current example notebook? The save strategy and evaluation strategy need to match when you set it to load the best model in the end. The current example notebook arguments should not have this error, but please confirm which ones you are using. Please also see the relevant discussion here: https://discuss.huggingface.co/t/save-only-best-model-in-trainer/8442/4

Hi, I use same parameters mentioned in: https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/gene_classification.ipynb

# set model parameters
# max input size
max_input_size = 2 ** 11  # 2048



# set training parameters
# max learning rate
max_lr = 5e-5
# how many pretrained layers to freeze
freeze_layers = 4
# number gpus
num_gpus = 1
# number cpu cores
num_proc = 24
# batch size for training and eval
geneformer_batch_size = 12
# learning schedule
lr_schedule_fn = "linear"
# warmup steps
warmup_steps = 500
# number of epochs
epochs = 1
# optimizer
optimizer = "adamw"

# set training arguments
subsample_size = 10_000
training_args = {
    "learning_rate": max_lr,
    "do_train": True,
    "evaluation_strategy": "no",
    "logging_steps": 100,
    "group_by_length": True,
    "length_column_name": "length",
    "disable_tqdm": False,
    "lr_scheduler_type": lr_schedule_fn,
    "warmup_steps": warmup_steps,
    "weight_decay": 0.001,
    "per_device_train_batch_size": geneformer_batch_size,
    "per_device_eval_batch_size": geneformer_batch_size,
    "num_train_epochs": epochs,
    "load_best_model_at_end": True,
}

But I still receive this error. Moreover, I do not believe that every notebook can run successfully. For example, in this notebook, there is a variable which is not defined in ahead but directly used:

max_sequence_length

Thank you for providing the arguments you are using. The current notebook has different training arguments than what you are using. (See below and at the link in your comment). Specifically, the current notebook does not have it set to load the best model at the end, which is what is causing the conflict with the earlier arguments regarding evaluation and save strategy. You may be using an outdated version. That is why I had suggested you check whether your arguments are the same as the current notebook. We suggest you pull the current notebook and try again. Please open a new issue if there is any different error in the updated notebook, but this error with training arguments should be resolved.

training_args = {
"learning_rate": max_lr,
"do_train": True,
"evaluation_strategy": "no",
"save_strategy": "epoch",
"logging_steps": 100,
"group_by_length": True,
"length_column_name": "length",
"disable_tqdm": False,
"lr_scheduler_type": lr_schedule_fn,
"warmup_steps": warmup_steps,
"weight_decay": 0.001,
"per_device_train_batch_size": geneformer_batch_size,
"per_device_eval_batch_size": geneformer_batch_size,
"num_train_epochs": epochs,
}

ctheodoris changed discussion status to closed

Hi, I meet a new error after using new version. Prior this error, I run pip install --upgraed accelecrate

110 training_args_init = TrainingArguments(**training_args)
112 # create the trainer
113 trainer = Trainer(
114 model=model,
115 args=training_args_init,
--> 116 data_collator=DataCollatorForGeneClassification(),
117 train_dataset=trainset_labeled,
118 eval_dataset=evalset_train_labeled
119 )
121 # train the gene classifier
122 trainer.train()

TypeError: init() missing 1 required positional argument: 'tokenizer'

Are there any problems? Thanks a lot.

Thank you for following up. Did you git pull the Huggingface Geneformer repository or just replace the gene classification notebook? If you did not pull the current repository, please do so as you may be using an outdated version.

This code was developed over 2 years ago and the manuscript was submitted over 1 year ago so there were some changes in Huggingface transformers since then that caused this error. However, we already updated the collator to resolve this issue - the current version was tested for transformers 4.28.0. Please open a new issue if you pulled the current repository and are encountering a different error though.

Hi, thanks for your quick following up. I pull the new version and use pip install .again. I can show you the log:

git clone https://huggingface.co/ctheodoris/Geneformer
Cloning into 'Geneformer'...
remote: Enumerating objects: 204, done.
remote: Counting objects: 100% (189/189), done.
remote: Compressing objects: 100% (165/165), done.
remote: Total 204 (delta 97), reused 55 (delta 23), pack-reused 15
Receiving objects: 100% (204/204), 1.63 MiB | 10.44 MiB/s, done.
Resolving deltas: 100% (100/100), done.

pip install .
Processing /gpfs/gibbs/pi/zhao/tl688/Geneformer/Geneformer
Preparing metadata (setup.py) ... done
Requirement already satisfied: datasets in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from geneformer==0.0.1) (2.12.0)
Requirement already satisfied: loompy in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from geneformer==0.0.1) (3.0.7)
Requirement already satisfied: numpy in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from geneformer==0.0.1) (1.24.3)
Requirement already satisfied: transformers in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from geneformer==0.0.1) (4.29.2)
Requirement already satisfied: pyarrow>=8.0.0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (12.0.0)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (0.3.6)
Requirement already satisfied: pandas in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (2.0.2)
Requirement already satisfied: requests>=2.19.0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (4.65.0)
Requirement already satisfied: xxhash in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (3.2.0)
Requirement already satisfied: multiprocess in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (0.70.14)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (2023.5.0)
Requirement already satisfied: aiohttp in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (3.8.4)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.11.0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (0.15.1)
Requirement already satisfied: packaging in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (23.1)
Requirement already satisfied: responses<0.19 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (0.18.0)
Requirement already satisfied: pyyaml>=5.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from datasets->geneformer==0.0.1) (6.0)
Requirement already satisfied: h5py in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from loompy->geneformer==0.0.1) (3.8.0)
Requirement already satisfied: scipy in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from loompy->geneformer==0.0.1) (1.10.1)
Requirement already satisfied: setuptools in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from loompy->geneformer==0.0.1) (67.7.2)
Requirement already satisfied: numba in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from loompy->geneformer==0.0.1) (0.57.0)
Requirement already satisfied: click in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from loompy->geneformer==0.0.1) (8.1.3)
Requirement already satisfied: numpy-groupies in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from loompy->geneformer==0.0.1) (0.9.22)
Requirement already satisfied: filelock in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from transformers->geneformer==0.0.1) (3.12.0)
Requirement already satisfied: regex!=2019.12.17 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from transformers->geneformer==0.0.1) (2023.6.3)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from transformers->geneformer==0.0.1) (0.13.3)
Requirement already satisfied: attrs>=17.3.0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from aiohttp->datasets->geneformer==0.0.1) (23.1.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from aiohttp->datasets->geneformer==0.0.1) (3.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from aiohttp->datasets->geneformer==0.0.1) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from aiohttp->datasets->geneformer==0.0.1) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from aiohttp->datasets->geneformer==0.0.1) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from aiohttp->datasets->geneformer==0.0.1) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from aiohttp->datasets->geneformer==0.0.1) (1.3.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets->geneformer==0.0.1) (4.6.3)
Requirement already satisfied: idna<4,>=2.5 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from requests>=2.19.0->datasets->geneformer==0.0.1) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from requests>=2.19.0->datasets->geneformer==0.0.1) (2.0.2)
Requirement already satisfied: certifi>=2017.4.17 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from requests>=2.19.0->datasets->geneformer==0.0.1) (2023.5.7)
Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from numba->loompy->geneformer==0.0.1) (0.40.0)
Requirement already satisfied: importlib-metadata in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from numba->loompy->geneformer==0.0.1) (6.6.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from pandas->datasets->geneformer==0.0.1) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from pandas->datasets->geneformer==0.0.1) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from pandas->datasets->geneformer==0.0.1) (2023.3)
Requirement already satisfied: six>=1.5 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas->datasets->geneformer==0.0.1) (1.16.0)
Requirement already satisfied: zipp>=0.5 in /gpfs/gibbs/project/zhao/tl688/conda_envs/geneformer/lib/python3.8/site-packages (from importlib-metadata->numba->loompy->geneformer==0.0.1) (3.15.0)
Building wheels for collected packages: geneformer
Building wheel for geneformer (setup.py) ... done
Created wheel for geneformer: filename=geneformer-0.0.1-py3-none-any.whl size=788358 sha256=93f6c8c7a7cea8fc7e896bf156a98966d15d38ddcfc9e89d31bb8c31cd30ef1e
Stored in directory: /tmp/pip-ephem-wheel-cache-11tw161c/wheels/dc/cd/84/cbcf18cccec91328987bffbd0de23ad637bf97827956af428c
Successfully built geneformer
Installing collected packages: geneformer
Attempting uninstall: geneformer
Found existing installation: geneformer 0.0.1
Uninstalling geneformer-0.0.1:
Successfully uninstalled geneformer-0.0.1
Successfully installed geneformer-0.0.1

iLOVE2D changed discussion status to open

Thank you for confirming that. Unfortunately, I am not able to reproduce your error. When I run the gene classification fine-tuning, there is no error encountered. Could you compare a diff of your local version with the current repository geneformer/collator_for_classification.py to ensure they are the same? If they are, please let me know which version of Huggingface transformers you are using. We have tested the updated code with 4.28.0.

Hi, many thanks for your help again. It is my version problem. That is because under my current use folder, there is also a geneformer folder and my model will prefer loading that folder.

I meet another new problem:

    126 trainer.save_model(ksplit_model_dir)
    128 # evaluate model
--> 129 fpr, tpr, interp_tpr, conf_mat = classifier_predict(trainer.model, evalset_oos_labeled, geneformer_batch_size, mean_fpr)
    131 # append to tpr and roc lists
    132 confusion = confusion + conf_mat

Cell In[10], line 38, in classifier_predict(model, evalset, forward_batch_size, mean_fpr)
     35         predict_logits += [torch.squeeze(outputs.logits.to("cpu"))]
     36         predict_labels += [torch.squeeze(label_batch.to("cpu"))]
---> 38 logits_by_cell = torch.cat(predict_logits)
     39 all_logits = logits_by_cell.reshape(-1, logits_by_cell.shape[2])
     40 labels_by_cell = torch.cat(predict_labels)

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 797 but got size 636 for tensor number 1 in the list.
iLOVE2D changed discussion status to closed

I think it is caused by the dimension mismatching in gene prediction step:

torch.Size([12, 797, 2])
torch.Size([12, 636, 2])
torch.Size([12, 631, 2])
torch.Size([12, 953, 2])
torch.Size([12, 536, 2])
torch.Size([12, 972, 2])
torch.Size([12, 560, 2])
torch.Size([12, 710, 2])
torch.Size([12, 486, 2])
torch.Size([12, 843, 2])
torch.Size([12, 570, 2])
torch.Size([12, 634, 2])
torch.Size([12, 712, 2])
torch.Size([12, 754, 2])
torch.Size([12, 740, 2])

Thanks a lot.

iLOVE2D changed discussion status to open

Thank you for the information. torch.cat() requires that the tensors be of the same size in all dimensions except 0 in order to be concatenated. The first tensor has a dimension 1 of 797 so it expects the remainder to have that same size in that dimension. However, these should be outputs that are prediction results so should be of the same size there. Unfortunately, I am not able to reproduce your error because when I run the gene classification fine-tuning, there is no error encountered. Could you provide more information on your input data and how many classes there are so that I can try to reproduce the error?

I pushed a fix to the notebook to resolve this issue by padding the evaluation data so that the prediction outputs are of equal size in the 2nd dimension. Please pull the new one and retry.

ctheodoris changed discussion status to closed

Sign up or log in to comment