anndata tokenizer

#102
by giovp - opened

hi, I added a method to enable the tokenizer to also take anndata files as input. I tested it with the following:


# take any anndata file
# ...

# save it as loom and as h5ad
adata.write_loom("/path/to/folder/temp_files/temp.loom")
adata.write("/path/to/folder/temp_files/temp.h5ad")

# tokenize data
from geneformer import TranscriptomeTokenizer

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.h5ad", "/path/to/folder/temp_files", "dataset_h5ad", "h5ad")

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.loom", "/path/to/folder/temp_files", "dataset_loom", "loom")

and test with e.g. diff -r -q dataset_h5ad.dataset dataset_loom.dataset or reading the arrow files etc.

not sure if this is the best way to submit contributions via HF hub, diffs don't look good. Any help on how to do this properly is appreciated, in case there is interest to have this contribution.

giovp changed pull request status to closed
giovp changed pull request status to open

Thank you for the pull request! The tokenization of loom files scans through chunks without loading the entire file into memory to allow for memory-efficient processing of large files. Have you tested the anndata version on anndata files on the order of a million cells?

@giovp I wanted to follow up on whether you have tested this for large input files yet. Please let me know! Thank you so much!

Thank you so much for initiating this contribution. It has now been modified and merged under PR 170.

ctheodoris changed pull request status to closed

Sign up or log in to comment