ctheodoris/Geneformer · anndata tokenizer

Jul 7, 2023

•

edited Jul 7, 2023

hi, I added a method to enable the tokenizer to also take anndata files as input. I tested it with the following:


# take any anndata file
# ...

# save it as loom and as h5ad
adata.write_loom("/path/to/folder/temp_files/temp.loom")
adata.write("/path/to/folder/temp_files/temp.h5ad")

# tokenize data
from geneformer import TranscriptomeTokenizer

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.h5ad", "/path/to/folder/temp_files", "dataset_h5ad", "h5ad")

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.loom", "/path/to/folder/temp_files", "dataset_loom", "loom")

and test with e.g. diff -r -q dataset_h5ad.dataset dataset_loom.dataset or reading the arrow files etc.

not sure if this is the best way to submit contributions via HF hub, diffs don't look good. Any help on how to do this properly is appreciated, in case there is interest to have this contribution.

add anndata tokenizerb23ca9d0

giovp changed pull request status to closed Jul 7, 2023

giovp changed pull request status to open Jul 7, 2023

update tokenizer to use total counts42acd144

ctheodoris

Owner Jul 16, 2023

Thank you for the pull request! The tokenization of loom files scans through chunks without loading the entire file into memory to allow for memory-efficient processing of large files. Have you tested the anndata version on anndata files on the order of a million cells?

ctheodoris

Owner Jul 26, 2023

@giovp I wanted to follow up on whether you have tested this for large input files yet. Please let me know! Thank you so much!

ctheodoris

Owner Sep 20, 2023

Thank you so much for initiating this contribution. It has now been modified and merged under PR 170.

ctheodoris changed pull request status to closed Sep 20, 2023