ctheodoris/Geneformer · Upload tokenizer.py

giovp

Jul 6, 2023

Enable tokenier to work on anndata files

Upload tokenizer.py411fc8e0

giovp

Jul 6, 2023

hi, I added a method to enable the tokenizer to also take anndata files as input. I tested it with the following:


# take any anndata file
# ...

# save it as loom and as h5ad
adata.write_loom("/path/to/folder/temp_files/temp.loom")
adata.write("/path/to/folder/temp_files/temp.h5ad")

# tokenize data
from geneformer import TranscriptomeTokenizer

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.h5ad", "/path/to/folder/temp_files", "dataset_h5ad", "h5ad")

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.loom", "/path/to/folder/temp_files", "dataset_loom", "loom")

and test with e.g. diff -r -q dataset_h5ad.dataset dataset_loom.dataset or reading the arrow files etc.

not sure if this is the best way to submit contributions via HF hub, diffs don't look good. Any help on how to do this properly is appreciated, in case there is interest to have this contribution.

giovp

Jul 6, 2023

found some bugs, closing and reopening later

giovp changed pull request status to closed Jul 6, 2023