yurakuratov commited on
Commit
9525e19
1 Parent(s): 86f25d8

update readme

Browse files
Files changed (1) hide show
  1. README.md +16 -9
README.md CHANGED
@@ -4,13 +4,17 @@ tags:
4
  - human_genome
5
  ---
6
 
7
- # GENA-LM (BigBird-base T2T)
8
 
9
- GENA-LM (BigBird-base T2T) is a transformer masked language model trained on human DNA sequence. GENA-LM (BigBird-base T2T) follows BigBird architecture.
10
 
11
- Differences between GENA-LM (BigBird-base T2T) and DNABERT:
 
 
 
 
12
  - BPE tokenization instead of k-mers;
13
- - input sequence size is about 24000 nucleotides (4096 BPE tokens) compared to 510 nucleotides of DNABERT;
14
  - pre-training on T2T vs. GRCh38.p13 human genome assembly.
15
 
16
  Source code and data: https://github.com/AIRI-Institute/GENA_LM
@@ -37,7 +41,7 @@ model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm
37
  ```
38
 
39
  ## Model description
40
- GENA-LM (BigBird-base T2T) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for `gena-lm-bigbird-base-t2t` is similar to the `google/bigbird-roberta-base`:
41
 
42
  - 4096 Maximum sequence length
43
  - 12 Layers, 12 Attention heads
@@ -49,11 +53,14 @@ GENA-LM (BigBird-base T2T) model is trained in a masked language model (MLM) fas
49
  - sliding window blocks: 3
50
  - 32k Vocabulary size, tokenizer trained on DNA data.
51
 
52
- We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling SNPs human mutations. Pre-training was performed for 1,070,000 iterations with batch size 256.
 
 
 
53
 
54
  ## Citation
55
- ```
56
- @article {GENA_LM,
57
  author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
58
  title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
59
  elocation-id = {2023.06.12.544594},
@@ -64,4 +71,4 @@ We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome asse
64
  eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
65
  journal = {bioRxiv}
66
  }
67
- ```
 
4
  - human_genome
5
  ---
6
 
7
+ # GENA-LM (gena-lm-bigbird-base-t2t)
8
 
9
+ GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
10
 
11
+ GENA-LM models are transformer masked language models trained on human DNA sequence.
12
+
13
+ `gena-lm-bigbird-base-t2t` follows the BigBird architecture and its HuggingFace implementation.
14
+
15
+ Differences between GENA-LM (`gena-lm-bigbird-base-t2t`) and DNABERT:
16
  - BPE tokenization instead of k-mers;
17
+ - input sequence size is about 36000 nucleotides (4096 BPE tokens) compared to 512 nucleotides of DNABERT;
18
  - pre-training on T2T vs. GRCh38.p13 human genome assembly.
19
 
20
  Source code and data: https://github.com/AIRI-Institute/GENA_LM
 
41
  ```
42
 
43
  ## Model description
44
+ GENA-LM (`gena-lm-bigbird-base-t2t`) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for `gena-lm-bigbird-base-t2t` is similar to the `google/bigbird-roberta-base`:
45
 
46
  - 4096 Maximum sequence length
47
  - 12 Layers, 12 Attention heads
 
53
  - sliding window blocks: 3
54
  - 32k Vocabulary size, tokenizer trained on DNA data.
55
 
56
+ We pre-trained `gena-lm-bigbird-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling mutations from 1000-genome SNPs (gnomAD dataset). Pre-training was performed for 1,070,000 iterations with batch size 256.
57
+
58
+ ## Evaluation
59
+ For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
60
 
61
  ## Citation
62
+ ```bibtex
63
+ @article{GENA_LM,
64
  author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
65
  title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
66
  elocation-id = {2023.06.12.544594},
 
71
  eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
72
  journal = {bioRxiv}
73
  }
74
+ ```