yangheng commited on
Commit
bbf9a61
1 Parent(s): e598751

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -9
README.md CHANGED
@@ -1,16 +1,44 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
4
-
5
  # PlantRNA-FM: An Interpretable RNA Foundation Model for Exploration Functional RNA Motifs in Plants
6
 
7
  ## Introduction
8
  In the dynamic field of life sciences, the exploration of RNA as a fundamental element in biological processes has led to significant scientific advancements. RNA molecules, characterized by complex sequences and structures, play critical roles in plant growth, development, and adaptation to environmental changes. Recent developments in artificial intelligence, specifically foundation models (FMs), have opened new frontiers for understanding and harnessing this complexity. Building on this momentum, we introduce PlantRNA-FM, a state-of-the-art RNA foundation model tailored for plants. This model integrates both RNA sequence and structural data from an extensive compilation of plant species, enabling unprecedented accuracy in predicting RNA functions and understanding translation dynamics. By combining robust pre-training on diverse RNA data with sophisticated interpretative frameworks, PlantRNA-FM sets a new standard in RNA bioinformatics, providing deep insights into the functional significance of RNA motifs within the plant transcriptome.
9
 
10
  ## Model Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ![model.png](model.png)
13
- Figure 1. Schematic overview of the Pre-training Phase of PlantRNA-FM. The pre-training dataset comprises transcriptomic sequences from 1,124 plant species, consisting of approximately 25.0M RNA sequences and 54.2B RNA bases. The green dots on the global mean temperature map represent the geographical distribution of these plant species across the world.
 
 
 
 
 
 
 
 
14
 
15
  ## Requirements
16
  - Python 3.9+
@@ -20,6 +48,8 @@ Figure 1. Schematic overview of the Pre-training Phase of PlantRNA-FM. The pre-t
20
 
21
  ## Usage
22
  Please install the requirements and follow the instructions below to run the PlantRNA-FM model.
 
 
23
  ### Model loading
24
  ```python
25
  from transformers import AutoModel, AutoTokenizer
@@ -53,10 +83,4 @@ PlantRNA-FM is licensed under the MIT License. Many thanks to all the authors of
53
  The model is co-developed by ColaLAB@UniversityofExeter and JIC@NorwichResearchPark (alphabetically ordered).
54
 
55
  ## Citation
56
- BioArchive Link: [PlantRNA-FM: An Interpretable RNA Foundation Model for Exploration Functional RNA Motifs in Plants](TBC)
57
-
58
- ## Funding
59
- This work was supported by National Key Research and Development Program of China [2023YFA0913500] (HZ); National Key Research and Development Program of China [2021YFF1000900] (HZ); National Natural Science Foundation of China [32170229] (HZ); Fundamental Research Funds for the Central Universities [2412023YQ005] (HZ); the China Scholarship Council [No.202206620047] (WS); the United Kingdom Biotechnology and Biological Sciences Research Council (BBSRC) [BB/X01102X/1] (HY, YD); European Research Council (ERC) [selected by the ERC, funded by BBSRC Horizon Europe Guarantee [EP/Y009886/1] (YD); Human Frontier Science Program Fellowship [LT001077/2021-L] (HY); UKRI Future Leaders Fellowship [MR/S017062/1, MR/X011135/1] (KL); Kan Tong Po International Fellowship [KTP\R1\231017] (KL); Amazon Research Award (KL) and National Natural Science Foundation of China [62376056, 62076056] (KL).
60
-
61
-
62
-
 
1
  ---
2
  license: mit
3
+ language:
4
+ - rna
5
+ - dna
6
+ datasets:
7
+ - OneKP
8
+ library_name: OmniGenome
9
+ tags:
10
+ - biology
11
  ---
 
12
  # PlantRNA-FM: An Interpretable RNA Foundation Model for Exploration Functional RNA Motifs in Plants
13
 
14
  ## Introduction
15
  In the dynamic field of life sciences, the exploration of RNA as a fundamental element in biological processes has led to significant scientific advancements. RNA molecules, characterized by complex sequences and structures, play critical roles in plant growth, development, and adaptation to environmental changes. Recent developments in artificial intelligence, specifically foundation models (FMs), have opened new frontiers for understanding and harnessing this complexity. Building on this momentum, we introduce PlantRNA-FM, a state-of-the-art RNA foundation model tailored for plants. This model integrates both RNA sequence and structural data from an extensive compilation of plant species, enabling unprecedented accuracy in predicting RNA functions and understanding translation dynamics. By combining robust pre-training on diverse RNA data with sophisticated interpretative frameworks, PlantRNA-FM sets a new standard in RNA bioinformatics, providing deep insights into the functional significance of RNA motifs within the plant transcriptome.
16
 
17
  ## Model Overview
18
+ The plant transcriptome data used for pre-training PlantRNA-FM was obtained from the one thousand plant transcriptomes project (1KP).
19
+ Note that modeling genomic sequences differs significantly from natural language modeling.
20
+ For instance, while RNA sequences are one-dimensional,
21
+ they strictly follow biological genomic patterns and depend heavily on certain structural characteristics.
22
+ In contrast, natural language models are more resilient and can tolerate linguistic errors such as typos and grammar mistakes.
23
+ Thus, effective RNA sequence curation is crucial to minimize the impact of noisy data and enhance modeling performance.
24
+ Specifically, our data curation protocol is as follows.
25
+ - Sequence truncation and filtering: We truncated RNA sequences exceeding 512 nucleotides to comply with the model's maximum length capacity and
26
+ filtered out sequences shorter than 20 nucleotides to eliminate noise, such as RNA fragment sequences.
27
+ - RNA secondary structure annotation: Given the significant impact of RNA secondary structures on sequence function,
28
+ we annotated the local RNA structures of all RNA sequences using ViennaRNA (with parameters "maxBPspan"=30)25.
29
+ - Annotation of CDS and UTR sequences: After obtaining the assembled transcripts and translated RNA regions from the dataset,
30
+ we retrieve the CDS (translated RNA), 5’ UTR, and 3’ UTR sequences (upstream and downstream of the translated RNA).
31
 
32
  ![model.png](model.png)
33
+ Fig. 1. Schematic overview of the Pre-training Phase of PlantRNA-FM. The pre-training dataset comprises transcriptomic sequences from 1,124 plant species, consisting of approximately 25.0M RNA sequences and 54.2B RNA bases. The green dots on the global mean temperature map represent the geographical distribution of these plant species across the world.
34
+
35
+ In this study, we developed PlantRNA-FM, a specialised language model based on the transformer architecture (Fig. 1).
36
+ PlantRNA-FM has 35 million parameters, including 12 transformer network layers, 24 attention heads,
37
+ and an embedding dimension of 480. We applied layer normalisation and residual connections both before and after the encoder block.
38
+ As our focus is on RNA understanding rather than generation, we only utilised the encoder component of the transformer architecture.
39
+ PlantRNA-FM is capable of processing sequences up to 512 nucleotides in length, making it compatible with consumer-grade GPUs,
40
+ such as the Nvidia RTX 4090, with a batch size of 16. The model was trained on four A100 GPUs over a period of three weeks,
41
+ completing 3 epochs.
42
 
43
  ## Requirements
44
  - Python 3.9+
 
48
 
49
  ## Usage
50
  Please install the requirements and follow the instructions below to run the PlantRNA-FM model.
51
+
52
+
53
  ### Model loading
54
  ```python
55
  from transformers import AutoModel, AutoTokenizer
 
83
  The model is co-developed by ColaLAB@UniversityofExeter and JIC@NorwichResearchPark (alphabetically ordered).
84
 
85
  ## Citation
86
+ BioArchive Link: [PlantRNA-FM: An Interpretable RNA Foundation Model for Exploration Functional RNA Motifs in Plants](TBC)