AIRI-Institute
/

gena-lm-bert-base-fly

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

Yura Kuratov commited on Apr 17

Commit

6c97194

•

1 Parent(s): 32b0375

add fly data info to readme

Files changed (1) hide show

README.md +3 -2

README.md CHANGED Viewed

@@ -17,7 +17,8 @@ GENA-LM (`gena-lm-bert-base-fly`) model is trained with a masked language model
 - 768 Hidden size
 - 32k Vocabulary size
-We pre-trained `gena-lm-bert-base-fly` using TODO(data). Pre-training was performed for 1,900,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer to use [Pre-Layer normalization](https://arxiv.org/abs/2002.04745). We upload checkpoint with the best MLM accuracy on validation set.
 Source code and data: https://github.com/AIRI-Institute/GENA_LM
@@ -84,4 +85,4 @@ For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2
 	journal = {bioRxiv}
 }
-```

 - 768 Hidden size
 - 32k Vocabulary size
+We pre-trained `gena-lm-bert-base-fly` on data obtained from Progressive Cactus alignment of 298 drosophilid species generated by [Kim et al.](https://www.biorxiv.org/content/10.1101/2023.10.02.560517v1), dataset source: [link](https://doi.org/10.5061/dryad.x0k6djhrd).
+Pre-training was performed for 1,925,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer to use [Pre-Layer normalization](https://arxiv.org/abs/2002.04745). We upload the checkpoint with the best loss on validation set.
 Source code and data: https://github.com/AIRI-Institute/GENA_LM
 	journal = {bioRxiv}
 }
+```