SCIR-HI
/

ada-t5-small

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ada-t5-small / README.md

SCIR-HI's picture

Update README.md

4a785f8 11 months ago

|

history blame contribute delete

1.48 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- chemistry
	- biology
	- medical
	---
	### Pre-trained T5-small model on PseudoMD-1M datasets.

	PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the [paper](https://arxiv.org/abs/2309.05203).


	### Pre-training details
	\| Parameters \| N \|
	\| ---- \| ----\|
	\| Corpus Size \| 1,020,139 \|
	\| Training Steps \| 100,000\|
	\| Learning Rate \| 1e-3\|
	\| Batch Size \| 128 \|
	\| Warm-up Steps \| 1000\|
	\| Weight decay\| 0.1\|

	### Example Usage

	```python
	from transformers import AutoTokenizer, T5ForConditionalGeneration

	model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-small")
	tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-small", model_max_length=512)
	```

	### [Citation](https://arxiv.org/abs/2309.05203)

	```bibtex
	@article{chen2023artificially,
	title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
	author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
	journal={arXiv preprint arXiv:2309.05203},
	year={2023}
	}
	```