File size: 1,477 Bytes
5d9afe5 366b189 5d9afe5 366b189 4a785f8 366b189 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
---
license: apache-2.0
language:
- en
tags:
- chemistry
- biology
- medical
---
### Pre-trained T5-small model on PseudoMD-1M datasets.
PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the [paper](https://arxiv.org/abs/2309.05203).
### Pre-training details
| Parameters | N |
| ---- | ----|
| Corpus Size | 1,020,139 |
| Training Steps | 100,000|
| Learning Rate | 1e-3|
| Batch Size | 128 |
| Warm-up Steps | 1000|
| Weight decay| 0.1|
### Example Usage
```python
from transformers import AutoTokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-small")
tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-small", model_max_length=512)
```
### [Citation](https://arxiv.org/abs/2309.05203)
```bibtex
@article{chen2023artificially,
title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
journal={arXiv preprint arXiv:2309.05203},
year={2023}
}
``` |