Upload tokenizer.py
#99
by
giovp
- opened
Enable tokenier to work on anndata files
hi, I added a method to enable the tokenizer to also take anndata files as input. I tested it with the following:
# take any anndata file
# ...
# save it as loom and as h5ad
adata.write_loom("/path/to/folder/temp_files/temp.loom")
adata.write("/path/to/folder/temp_files/temp.h5ad")
# tokenize data
from geneformer import TranscriptomeTokenizer
tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.h5ad", "/path/to/folder/temp_files", "dataset_h5ad", "h5ad")
tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.loom", "/path/to/folder/temp_files", "dataset_loom", "loom")
and test with e.g. diff -r -q dataset_h5ad.dataset dataset_loom.dataset
or reading the arrow files etc.
not sure if this is the best way to submit contributions via HF hub, diffs don't look good. Any help on how to do this properly is appreciated, in case there is interest to have this contribution.
found some bugs, closing and reopening later
giovp
changed pull request status to
closed