tokenizer-uncropped-input_ids

#275
No description provided.

Added option to create_dataset function in tokenizer.py to keep all gene tokens for a cell in a new feature 'input_ids_uncropped' and the total number of genes in a cell before truncation/cropping 'length_uncropped', in addition to previous truncation code.

This allows analysis of uncropped gene token ranks, which can be useful for understanding the coverage of genes across cells in dataset, and in how many cells is a gene <=2048, compared to > 2048.

The changes also move the cropping and length calculations into a single function, which saves iterating over the dataset twice, which can be useful for large datasets.

jamieb-nvs changed pull request status to open
ctheodoris changed pull request status to merged

Thank you for your valuable contribution to the codebase!

Sign up or log in to comment