I would like to confirm some information in the paper about genome annotation in the embeddings

#5
by stavisav - opened

Hello. I am going to explore GROVER in my graduation thesis and I just wanted to make sure I understand the part where the genome annotation was added to the embeddings. I would like to explain this part in the document.

From what I gathered, after pretraining, the tokens are annotated with features such as GC content, strand info, repeated elements and gene coordinates.
I'm not sure I understood exactly how the annotations are added/appended to the embeddings. Are they also transformed into numerical vectors? And then appended after each word's embedding or after the whole sequence?

Is that about it? I found this most interesting, thank you so much.

Biomedical Genomics lab of Anna Poetsch org

Hello,
the annotations are not included at the input of the model. The model is agnostic of any genomic element during training and evaluation.
We extract the embeddings of the whole genome and the annotations are used only for descriptive purposes. For example in Fig. 5d. we have the UMAP of all the embeddings and in each square we color only the tokens with the respective annotation.
I hope the explanation is clear, if not, do not hesitate to let us know.

Sign up or log in to comment