Yura Kuratov commited on
Commit
6c97194
1 Parent(s): 32b0375

add fly data info to readme

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -17,7 +17,8 @@ GENA-LM (`gena-lm-bert-base-fly`) model is trained with a masked language model
17
  - 768 Hidden size
18
  - 32k Vocabulary size
19
 
20
- We pre-trained `gena-lm-bert-base-fly` using TODO(data). Pre-training was performed for 1,900,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer to use [Pre-Layer normalization](https://arxiv.org/abs/2002.04745). We upload checkpoint with the best MLM accuracy on validation set.
 
21
 
22
  Source code and data: https://github.com/AIRI-Institute/GENA_LM
23
 
@@ -84,4 +85,4 @@ For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2
84
  journal = {bioRxiv}
85
  }
86
 
87
- ```
 
17
  - 768 Hidden size
18
  - 32k Vocabulary size
19
 
20
+ We pre-trained `gena-lm-bert-base-fly` on data obtained from Progressive Cactus alignment of 298 drosophilid species generated by [Kim et al.](https://www.biorxiv.org/content/10.1101/2023.10.02.560517v1), dataset source: [link](https://doi.org/10.5061/dryad.x0k6djhrd).
21
+ Pre-training was performed for 1,925,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer to use [Pre-Layer normalization](https://arxiv.org/abs/2002.04745). We upload the checkpoint with the best loss on validation set.
22
 
23
  Source code and data: https://github.com/AIRI-Institute/GENA_LM
24
 
 
85
  journal = {bioRxiv}
86
  }
87
 
88
+ ```