Yak-hbdx commited on
Commit
b1643b8
1 Parent(s): 0b11a42

Push model using huggingface_hub.

Browse files
Files changed (3) hide show
  1. README.md +9 -145
  2. config.json +0 -0
  3. model.safetensors +3 -0
README.md CHANGED
@@ -1,145 +1,9 @@
1
- # TransfoRNA
2
- TransfoRNA is a **bioinformatics** and **machine learning** tool based on **Transformers** to provide annotations for 11 major classes (miRNA, rRNA, tRNA, snoRNA, protein
3
- -coding/mRNA, lncRNA, YRNA, piRNA, snRNA, snoRNA and vtRNA) and 1923 sub-classes
4
- for **human small RNAs and RNA fragments**. These are typically detected by RNA-seq NGS (next generation sequencing) data.
5
-
6
- TransfoRNA can be trained on just the RNA sequences and optionally on additional information such as secondary structure. The result is a major and sub-class assignment combined with a novelty score (Normalized Levenshtein Distance) that quantifies the difference between the query sequence and the closest match found in the training set. Based on that it decides if the query sequence is novel or familiar. TransfoRNA uses a small curated set of ground truth labels obtained from common knowledge-based bioinformatics tools that map the sequences to transcriptome databases and a reference genome. Using TransfoRNA's framewok, the high confidence annotations in the TCGA dataset can be increased by 3 folds.
7
-
8
-
9
- ## Dataset (Objective):
10
- - **The Cancer Genome Atlas, [TCGA](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga)** offers sequencing data of small RNAs and is used to evaluate TransfoRNAs classification performance
11
- - Sequences are annotated based on a knowledge-based annotation approach that provides annotations for ~2k different sub-classes belonging to 11 major classes.
12
- - Knowledge-based annotations are divided into three sets of varying confidence levels: a **high-confidence (HICO)** set, a **low-confidence (LOCO)** set, and a **non-annotated (NA)** set for sequences that could not be annotated at all. Only HICO annotations are used for training.
13
- - HICO RNAs cover ~2k sub-classes and constitute 19.6% of all RNAs found in TCGA. LOCO and NA sets comprise 66.9% and 13.6% of RNAs, respectively.
14
- - HICO RNAs are further divided into **in-distribution, ID** (374 sub-classes) and **out-of-distribution, OOD** (1549 sub-classes) sets.
15
- - Criteria for ID and OOD: Sub-class containing more than 8 sequences are considered ID, otherwise OOD.
16
- - An additional **putative 5' adapter affixes set** contains 294 sequences known to be technical artefacts. The 5’-end perfectly matches the last five or more nucleotides of the 5’-adapter sequence, commonly used in small RNA sequencing.
17
- - The knowledge-based annotation (KBA) pipline including installation guide is located under `kba_pipline`
18
-
19
- ## Models
20
- There are 5 classifier models currently available, each with different input representation.
21
- - Baseline:
22
- - Input: (single input) Sequence
23
- - Model: An embedding layer that converts sequences into vectors followed by a classification feed forward layer.
24
- - Seq:
25
- - Input: (single input) Sequence
26
- - Model: A transformer based encoder model.
27
- - Seq-Seq:
28
- - Input: (dual inputs) Sequence divided into even and odd tokens.
29
- - Model: A transformer encoder is placed for odd tokens and another for even tokens.
30
- - Seq-Struct:
31
- - Input: (dual inputs) Sequence + Secondary structure
32
- - Model: A transformer encoder for the sequence and another for the secondary structure.
33
- - Seq-Rev (best performant):
34
- - Input: (dual inputs) Sequence
35
- - Model: A transformer encoder for the sequence and another for the sequence reversed.
36
-
37
-
38
- *Note: These (Transformer) based models show overlapping and distinct capabilities. Consequently, an ensemble model is created to leverage those capabilities.*
39
-
40
-
41
- <img width="948" alt="Screenshot 2023-08-16 at 16 39 20" src="https://github.com/gitHBDX/TransfoRNA-Framework/assets/82571392/d7d092d8-8cbd-492a-9ccc-994ffdd5aa5f">
42
-
43
- ## Data Availability
44
- Downloading the data and the models can be done from [here](https://www.dropbox.com/sh/y7u8cofmg41qs0y/AADvj5lw91bx7fcDxghMbMtsa?dl=0).
45
-
46
- This will download three subfolders that should be kept on the same folder level as `src`:
47
- - `data`: Contains three files:
48
- - `TCGA` anndata with ~75k sequences and `var` columns containing the knowledge based annotations.
49
- - `HBDXBase.csv` containing a list of RNA precursors which are then used for data augmentation.
50
- - `subclass_to_annotation.json` holds mappings for every sub-class to major-class.
51
-
52
- - `models`:
53
- - `benchmark` : contains benchmark models trained on sncRNA and premiRNA data. (See additional datasets at the bottom)
54
- - `tcga`: All models trained on the TCGA data; `TransfoRNA_ID` (for testing and validation) and `TransfoRNA_FULL` (the production version) containing higher RNA major and sub-class coverage. Each of the two folders contain all the models trained seperately on major-class and sub-class.
55
- - `kba_pipeline`: contains mapping reference data required to run the knowledge based pipeline manually
56
- ## Repo Structure
57
- - configs: Contains the configurations of each model, training and inference settings.
58
-
59
- The `conf/main_config.yaml` file offers options to change the task, the training settings and the logging. The following shows all the options and permitted values for each option.
60
-
61
- <img width="835" alt="Screenshot 2024-05-22 at 10 19 15" src="https://github.com/gitHBDX/TransfoRNA/assets/82571392/225d2c98-ed45-4ca7-9e86-557a73af702d">
62
-
63
- - transforna contains two folders:
64
- - `src` folder which contains transforna package. View transforna's architecture [here](https://github.com/gitHBDX/TransfoRNA/blob/master/transforna/src/readme.md).
65
- - `bin` folder contains all scripts necessary for reproducing manuscript figures.
66
-
67
- ## Installation
68
-
69
- The `install.sh` is a script that creates an transforna environment in which all the required packages for TransfoRNA are installed. Simply navigate to the root directory and run from terminal:
70
-
71
- ```
72
- #make install script executable
73
- chmod +x install.sh
74
-
75
-
76
- #run script
77
- ./install.sh
78
- ```
79
-
80
- ## TransfoRNA API
81
- In `transforna/src/inference/inference_api.py`, all the functionalities of transforna are offered as APIs. There are two functions of interest:
82
- - `predict_transforna` : Computes for a set of sequences and for a given model, one of various options; the embeddings, logits, explanatory (similar) sequences, attentions masks or umap coordinates.
83
- - `predict_transforna_all_models`: Same as `predict_transforna` but computes the desired option for all the models as well as aggregates the output of the ensemble model.
84
- Both return a pandas dataframe containing the sequence along with the desired computation.
85
-
86
- Check the script at `src/test_inference_api.py` for a basic demo on how to call the either of the APIs.
87
-
88
- ## Inference from terminal
89
- For inference, two paths in `configs/inference_settings/default.yaml` have to be edited:
90
- - `sequences_path`: The full path to a csv file containing the sequences for which annotations are to be inferred.
91
- - `model_path`: The full path of the model. (currently this points to the Seq model)
92
-
93
- Also in the `main_config.yaml`, make sure to edit the `model_name` to match the input expected by the loaded model.
94
- - `model_name`: add the name of the model. One of `"seq"`,`"seq-seq"`,`"seq-struct"`,`"baseline"` or `"seq-rev"` (see above)
95
-
96
-
97
- Then, navigate the repositories' root directory and run the following command:
98
-
99
- ```
100
- python transforna/__main__.py inference=True
101
- ```
102
-
103
- After inference, an `inference_output` folder will be created under `outputs/` which will include two files.
104
- - `(model_name)_embedds.csv`: contains vector embedding per sequence in the inference set- (could be used for downstream tasks).
105
- *Note: The embedds of each sequence will only be logged if `log_embedds` in the `main_config` is `True`.*
106
- - `(model_name)_inference_results.csv`: Contains columns; Net-Label containing predicted label and Is Familiar? boolean column containing the models' novelty predictor output. (True: familiar/ False: Novel)
107
- *Note: The output will also contain the logits of the model is `log_logits` in the `main_config` is `True`.*
108
-
109
-
110
- ## Train on custom data
111
- TransfoRNA can be trained using input data as Anndata, csv or fasta. If the input is anndata, then `anndata.var` should contains all the sequences. Some changes has to be made (follow `configs/train_model_configs/tcga`):
112
-
113
- In `configs/train_model_configs/custom`:
114
- - `dataset_path_train` has to point to the input_data which should contain; a `sequence` column, a `small_RNA_class_annotation` coliumn indicating the major class if available (otherwise should be NaN), `five_prime_adapter_filter` specifies whether the sequence is considered a real sequence or an artifact (`True `for Real and `False` for artifact), a `subclass_name` column containing the sub-class name if available (otherwise should be NaN), and a boolean column `hico` indicating whether a sequence is high confidence or not.
115
- - If sampling from the precursor is required in order to augment the sub-classes, the `precursor_file_path` should include precursors. Follow the scheme of the HBDxBase.csv and have a look at `PrecursorAugmenter` class in `transforna/src/processing/augmentation.py`
116
- - `mapping_dict_path` should contain the mapping from sub class to major class. i.e: 'miR-141-5p' to 'miRNA'.
117
- - `clf_target` sets the classification target of the mopdel and should be either `sub_class_hico` for training on targets in `subclass_name` or `major_class_hico` for training on targets in `small_RNA_class_annotation`. For both, only high confidence sequences are selected for training (based on `hico` column).
118
-
119
- In configs/main_config, some changes should be made:
120
- - change `task` to `custom` or to whatever name the `custom.py` has been renamed.
121
- - set the `model_name` as desired.
122
-
123
- For training TransfoRNA from the root directory:
124
- ```
125
- python transforna/__main__.py
126
- ```
127
- Using [Hydra](https://hydra.cc/), any option in the main config can be changed. For instance, to train a `Seq-Struct` TransfoRNA model without using a validation split:
128
- ```
129
- python transforna/__main__.py train_split=False model_name='seq-struct'
130
- ```
131
- After training, an output folder is automatically created in the root directory where training is logged.
132
- The structure of the output folder is chosen by hydra to be `/day/time/results folders`. Results folders are a set of folders created during training:
133
- - `ckpt`: (containing the latest checkpoint of the model)
134
- - `embedds`:
135
- - Contains a file per each split (train/valid/test/ood/na).
136
- - Each file is a `csv` containing the sequences plus their embeddings (obtained by the model and represent numeric representation of a given RNA sequence) as well as the logits. The logits are values the models produce for each sequence, reflecting its confidence of a sequence belonging to a certain class.
137
- - `meta`: A folder containing a `yaml` file with all the hyperparameters used for the current run.
138
- - `analysis`: contains the learned novelty threshold seperating the in-distribution set(Familiar) from the out of distribution set (Novel).
139
- - `figures`: some figures are saved containing the Normalized Levenstein Distance NLD, distribution per split.
140
-
141
-
142
- ## Additional Datasets (Objective):
143
- - sncRNA, collected from [RFam](https://rfam.org/) (classification of RNA precursors into 13 classes)
144
- - premiRNA [human miRNAs](http://www.mirbase.org)(classification of true vs pseudo precursors)
145
-
 
1
+ ---
2
+ tags:
3
+ - pytorch_model_hub_mixin
4
+ - model_hub_mixin
5
+ ---
6
+
7
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
+ - Library: [More Information Needed]
9
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:edbf105ab477ca595e2a03ca92eaa23b93488720cd4743cb8d4fac183899238c
3
+ size 7089068