File size: 2,756 Bytes
508087f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
991fc3f
 
508087f
 
991fc3f
508087f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# GenerRNA
GenerRNA is a generative RNA language model based on a Transformer decoder-only architecture. It was pre-trained on 30M sequences, encompassing 17B nucleotides.

Here, you can find all the relevant scripts for running GenerRNA on your machine. GenerRNA enable you to generate RNA sequences in a zero-shot manner for exploring the RNA space, or to fine-tune the model using a specific dataset for generating RNAs belonging to a particular family or possessing specific characteristics.

# Requirements
A CUDA environment, and a minimum VRAM of 8GB was required.
### Dependencies
```
torch>=2.0
numpy
transformers==4.33.0.dev0
datasets==2.14.4
tqdm
```

# Usage
Firstly, combine the split model using the command `cat model.pt.part-* > model.pt.recombined`
#### Directory tree
```
.
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ configs 
β”‚   β”œβ”€β”€ example_finetuning.py
β”‚   └── example_pretraining.py
β”œβ”€β”€ experiments_data
β”œβ”€β”€ model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption)
β”œβ”€β”€ model.pt.part-ab 
β”œβ”€β”€ model.pt.part-ac
β”œβ”€β”€ model.pt.part-ad
β”œβ”€β”€ model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset
β”œβ”€β”€ model.py         # define the architecture
β”œβ”€β”€ sampling.py      # script to generate sequences
β”œβ”€β”€ tokenization.py  # preparete data
β”œβ”€β”€ tokenizer_bpe_1024
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ ....
β”œβ”€β”€ train.py # script for training/fine-tuning
```

### De novo Generation in a zero-shot fashion
Usage example:
```
python sampling.py \
    --out_path {output_file_path} \
    --max_new_tokens 256 \
    --ckpt_path {model.pt} \
    --tokenizer_path {path_to_tokenizer_directory, e.g /tokenizer_bpe_1024}
```
### Pre-training or Fine-tuning on your own sequences
First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header.
```
python tokenization.py \
    --data_dir {path_to_the_directory_containing_sequence_data} \
    --file_name {file_name_of_sequence_data} \
    --tokenizer_path {path_to_tokenizer_directory}  \
    --out_dir {directory_to_save_tokenized_data} \
    --block_size 256
```

Next, refer to `./configs/example_**.py` to create a config file of GPT model.

Lastly, excute following command:
```
python train.py \
    --config {path_to_your_config_file}
```

### Train your own tokenizer
Usage example:
```
python train_BPE.py \
    --txt_file_path {path_to_training_file(txt,each sequence is on a separate line)} \
    --vocab_size 50256 \
    --new_tokenizer_path {directory_to_save_trained_tokenizer} \
                
```

# License
The source code is licensed MIT. See `LICENSE`