anndata tokenizer
hi, I added a method to enable the tokenizer to also take anndata files as input. I tested it with the following:
# take any anndata file
# ...
# save it as loom and as h5ad
adata.write_loom("/path/to/folder/temp_files/temp.loom")
adata.write("/path/to/folder/temp_files/temp.h5ad")
# tokenize data
from geneformer import TranscriptomeTokenizer
tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.h5ad", "/path/to/folder/temp_files", "dataset_h5ad", "h5ad")
tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.loom", "/path/to/folder/temp_files", "dataset_loom", "loom")
and test with e.g. diff -r -q dataset_h5ad.dataset dataset_loom.dataset
or reading the arrow files etc.
not sure if this is the best way to submit contributions via HF hub, diffs don't look good. Any help on how to do this properly is appreciated, in case there is interest to have this contribution.
Thank you for the pull request! The tokenization of loom files scans through chunks without loading the entire file into memory to allow for memory-efficient processing of large files. Have you tested the anndata version on anndata files on the order of a million cells?
@giovp I wanted to follow up on whether you have tested this for large input files yet. Please let me know! Thank you so much!
Thank you so much for initiating this contribution. It has now been modified and merged under PR 170.