SCIR-HI
/

ada-t5-base

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

SCIR-HI commited on Dec 19, 2023

Commit

cb8c0b0

·

1 Parent(s): c360a71

Update README.md

Files changed (1) hide show

README.md +40 -0

README.md CHANGED Viewed

@@ -1,3 +1,43 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+- en
+tags:
+- chemistry
+- biology
+- medical
 ---
+### Pre-trained T5-base model on PseudoMD-1M datasets.
+PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the [paper](https://arxiv.org/abs/2309.05203).
+### Pre-training details
+| Parameters | N |
+| ---- | ----|
+| Corpus Size | 1,020,139 |
+| Training Steps | 100,000|
+| Learning Rate | 1e-3|
+| Batch Size | 128 |
+| Warm-up Steps | 1000|
+| Weight decay| 0.1|
+### Example Usage
+```python
+from transformers import AutoTokenizer, T5ForConditionalGeneration
+model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-base")
+tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-base", model_max_length=512)
+```
+### [Citation](https://arxiv.org/abs/2309.05203)
+```bibtex
+@article{chen2023artificially,
+  title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
+  author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
+  journal={arXiv preprint arXiv:2309.05203},
+  year={2023}
+}
+```