|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- chemistry |
|
- biology |
|
- medical |
|
--- |
|
### Pre-trained T5-small model on PseudoMD-1M datasets. |
|
|
|
PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the [paper](https://arxiv.org/abs/2309.05203). |
|
|
|
|
|
### Pre-training details |
|
| Parameters | N | |
|
| ---- | ----| |
|
| Corpus Size | 1,020,139 | |
|
| Training Steps | 100,000| |
|
| Learning Rate | 1e-3| |
|
| Batch Size | 128 | |
|
| Warm-up Steps | 1000| |
|
| Weight decay| 0.1| |
|
|
|
### Example Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, T5ForConditionalGeneration |
|
|
|
model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-small") |
|
tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-small", model_max_length=512) |
|
``` |
|
|
|
### [Citation](https://arxiv.org/abs/2309.05203) |
|
|
|
```bibtex |
|
@article{chen2023artificially, |
|
title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery}, |
|
author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing}, |
|
journal={arXiv preprint arXiv:2309.05203}, |
|
year={2023} |
|
} |
|
``` |