Upload tokenizer.py

#99
by giovp - opened

Enable tokenier to work on anndata files

hi, I added a method to enable the tokenizer to also take anndata files as input. I tested it with the following:


# take any anndata file
# ...

# save it as loom and as h5ad
adata.write_loom("/path/to/folder/temp_files/temp.loom")
adata.write("/path/to/folder/temp_files/temp.h5ad")

# tokenize data
from geneformer import TranscriptomeTokenizer

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.h5ad", "/path/to/folder/temp_files", "dataset_h5ad", "h5ad")

tk = TranscriptomeTokenizer({"tissue":"tissue"}, nproc=4) # some annotation
tk.tokenize_data("/path/to/folder/temp_files/temp.loom", "/path/to/folder/temp_files", "dataset_loom", "loom")

and test with e.g. diff -r -q dataset_h5ad.dataset dataset_loom.dataset or reading the arrow files etc.

not sure if this is the best way to submit contributions via HF hub, diffs don't look good. Any help on how to do this properly is appreciated, in case there is interest to have this contribution.

found some bugs, closing and reopening later

giovp changed pull request status to closed

Sign up or log in to comment