SCIR-HI commited on
Commit
cb8c0b0
1 Parent(s): c360a71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md CHANGED
@@ -1,3 +1,43 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - chemistry
7
+ - biology
8
+ - medical
9
  ---
10
+ ### Pre-trained T5-base model on PseudoMD-1M datasets.
11
+
12
+ PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the [paper](https://arxiv.org/abs/2309.05203).
13
+
14
+
15
+ ### Pre-training details
16
+ | Parameters | N |
17
+ | ---- | ----|
18
+ | Corpus Size | 1,020,139 |
19
+ | Training Steps | 100,000|
20
+ | Learning Rate | 1e-3|
21
+ | Batch Size | 128 |
22
+ | Warm-up Steps | 1000|
23
+ | Weight decay| 0.1|
24
+
25
+ ### Example Usage
26
+
27
+ ```python
28
+ from transformers import AutoTokenizer, T5ForConditionalGeneration
29
+
30
+ model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-base")
31
+ tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-base", model_max_length=512)
32
+ ```
33
+
34
+ ### [Citation](https://arxiv.org/abs/2309.05203)
35
+
36
+ ```bibtex
37
+ @article{chen2023artificially,
38
+ title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
39
+ author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
40
+ journal={arXiv preprint arXiv:2309.05203},
41
+ year={2023}
42
+ }
43
+ ```