Fill-Mask
Transformers
PyTorch
Joblib
Safetensors
DNA
biology
genomics
custom_code
Inference Endpoints
hdallatorre commited on
Commit
9edb641
1 Parent(s): ddcbf81

feat: Add model card

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -92,6 +92,9 @@ The masking procedure used is the standard one for Bert-style training:
92
 
93
  The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training.
94
 
 
 
 
95
 
96
  ### BibTeX entry and citation info
97
 
 
92
 
93
  The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training.
94
 
95
+ ### Architecture
96
+
97
+ The model belongs to the second generation of nucleotide transformers, with the changes in architecture consisting the use of rotary positional embeddings instead of learned ones, as well as the introduction of Gated Linear Units.
98
 
99
  ### BibTeX entry and citation info
100