How to do pre-training with my own 500,000 dataset

#309
by yehuicheng - opened

Thanks for Geneformer, it's a great tool.
I'm trying to pre-train with a dataset related to my current project, using the code under this file.https://huggingface.co/ctheodoris/Geneformer/tree/main/examples/pretraining_new_model

image.png
First I fellow https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/pretraining_new_model/obtain_nonzero_median_digests.ipynb

Because my data was filtered before processing to retain cells with total read counts within three standard deviations of the mean in that dataset, and mitochondrial reads within three standard deviations of the mean in that dataset.
Cells with fewer than seven Ensembl-annotated protein-coding or miRNA genes detected were excluded.So I changed the fliter_pass part of the code a bit.

image.png

And then this result appeared
5fb70f4fda8e97895c5eb2f63fbc145.png

That's the first thing I'd like to know if there's a problem with the way I'm handling this.

After that,I try it
https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/pretraining_new_model/pretrain_geneformer_w_deepspeed.py
in this step
image.png

Converting my dataset from adata or .loom format to sorted value .dataset.
examples/tokenizing_scRNAseq_data.ipynb

image.png

bug
10f31aab76c732d55ed30f5cbd91392.png

I change int32 to int64 or long,failed

3cc55e6d8ba07baa56b131761b1a2bb.png

I think your Geneformer-30M is much larger than my 500,000 dataset, and don't understand why it's out of the integer range.
By the way,I wonder if I should have used the last normalized pkl file to generate the dataset.

Thank you for your interest in Geneformer.

First of all, it is not clear why you are re-pretraining the model from scratch with only 500,000 cells. If you are trying to use Geneformer for predictions in a particular setting, we recommend fine-tuning the already-pretrained model, not pretraining it from scratch. If you are pretraining it from scratch for some other reason, I would be concerned that 500,000 cells will not yield optimal results. As you can see from the scaling laws we report in our manuscript, larger and more diverse corpuses consistently boost predictive potential, so severely reducing the number of cells will be problematic, especially if these cells are not diverse in tissue type etc. Finally, if you do intend to pretrain the model from scratch with only 500,000 cells, the 6 layer model configuration will likely be significantly undertrained. There is an optimal model size for each amount of pretraining data for these types of models - we would recommend reducing the model depth, maintaining the width-depth aspect ratio, if you intend to pretrain with only 500,000 cells.

Assuming you do intend to pretrain from scratch with the 500,000 cells:

  • The result you showed for obtaining the nonzero median digests does not look right. Please see the example notebook for a more likely expected results. The code you include before the plots from the nonzero median digest calculation is related to the tokenization, not the nonzero median digest calculation, so it is unclear what, if anything, you have changed from the nonzero median digest code.
  • You mentioned you changed the tokenization code in some way. You are welcome to provide a diff so we can more clearly see what is changed, but we would not recommend changing the code and are unlikely to be able to help troubleshoot alterations to the code we provide.
  • The example lengths file should be for your pretraining dataset, not the 30M one.
  • The C integer type problem is a known issue with Datasets - please see the prior closed discussions on this topic. You can try setting use_generator to True. It will be slower but will likely circumvent this error.
ctheodoris changed discussion status to closed

Thank you for your reply. Sorry, I can't reply on the same day because of the newly registered huggingface system.We are using PBMC cells from the same tissue to do the DA, I previously used Geneformer-30M that was already trained for fine-tuning, and now I would like to use PBMC from the same tissue for pre-training and then fine-tuning, and compare the results with the previous Geneformer-30.I think the odds are that the results are not as good as Geneformer-30M, but just wanted to try to see if the same organization would make a difference as a pre-trained model.

Precisely in the piece of getting non-zero summaries, I because the pre-trained data in the quality filter, not labeling but direct filtering, so the labeling piece of the fliter_pass commented out, do not need to subview, directly use view.

d1f50369f633e424b804f6419ecc426.png
Thanks again for the author's reply! Good luck with your research and good health!

Sign up or log in to comment