--- license: mit tags: - generated_from_trainer datasets: - sagawa/pubchem-10m-canonicalized metrics: - accuracy model-index: - name: PubChem-10m-deberta results: - task: name: Masked Language Modeling type: fill-mask dataset: name: sagawa/pubchem-10m-canonicalized type: sagawa/pubchem-10m-canonicalized metrics: - name: Accuracy type: accuracy value: 0.9741235263046233 --- # PubChem10m-deberta-base-output This model is a fine-tuned version of [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base) on the sagawa/pubchem-10m-canonicalized dataset. It achieves the following results on the evaluation set: - Loss: 0.0698 - Accuracy: 0.9741 ## Model description We trained deberta-base on SMILES from PubChem using the task of masked-language modeling (MLM). Its tokenizer is a character-level tokenizer trained on PubChem. ## Intended uses & limitations This model can be used for the prediction of molecules' properties, reactions, or interactions with proteins by changing the way of finetuning. ## Training and evaluation data We downloaded [PubChem data](https://drive.google.com/file/d/1ygYs8dy1-vxD1Vx6Ux7ftrXwZctFjpV3/view) and canonicalized them using RDKit. Then, we dropped duplicates. The total number of data is 9999960, and they were randomly split into train:validation=10:1. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 30 - eval_batch_size: 48 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 10.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | |:-------------:|:-----:|:------:|:---------------:|:--------:| | 0.0855 | 3.68 | 100000 | 0.0801 | 0.9708 | | 0.0733 | 7.37 | 200000 | 0.0702 | 0.9740 | ### Framework versions - Transformers 4.22.0.dev0 - Pytorch 1.12.0 - Datasets 2.4.1.dev0 - Tokenizers 0.11.6