anrilombard commited on
Commit
39ec458
·
verified ·
1 Parent(s): e9e5562

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +72 -45
README.md CHANGED
@@ -1,31 +1,38 @@
1
  ---
2
- datasets:
3
- - MOSES
4
  tags:
5
- - safe
6
- - datamol-io
7
- - molecule-design
8
- - smiles
9
- - generated_from_trainer
 
 
10
  model-index:
11
- - name: SAFE_20M
12
- results: []
13
  ---
14
 
15
  # SAFE_20M
16
 
17
- This model was trained from scratch on the MOSES dataset converted to SAFE format for molecule generation tasks.
18
- It achieves the following results on the evaluation set:
 
19
 
20
- - Loss: 0.4024
21
 
22
- ## Model description
23
 
24
- SAFE_20M is a transformer-based model designed for molecular generation tasks. It was trained on the MOSES dataset, which has been converted to the SAFE (SMILES Augmented For Encoding) format. This format is specifically tailored for improved molecular representation in machine learning tasks.
25
 
26
- The model is intended to generate valid and diverse molecular structures, which can be useful in various applications such as drug discovery, materials science, and chemical engineering.
27
 
28
- This model utilizes the SAFE framework, which was introduced in the following paper:
 
 
 
 
 
 
29
 
30
  ```bibtex
31
  @article{noutahi2024gotta,
@@ -42,42 +49,43 @@ This model utilizes the SAFE framework, which was introduced in the following pa
42
 
43
  We acknowledge and thank the authors for their valuable contribution to the field of molecular design.
44
 
45
- ## Intended uses & limitations
 
 
46
 
47
- This model is primarily intended for:
48
 
49
- - Generating molecular structures
50
- - Exploring chemical space for drug discovery
51
- - Assisting in the design of new materials
52
 
53
- Limitations:
54
 
55
- - The model's output should be validated by domain experts before practical application
56
- - Generated molecules may not always be synthetically feasible
57
- - The model's knowledge is limited to the chemical space represented in the MOSES dataset
58
 
59
- ## Training and evaluation data
60
 
61
- The model was trained on the MOSES (MOlecular SEtS) dataset, a benchmark dataset for molecular generation. The MOSES dataset was converted to the SAFE format.
62
 
63
- ## Training procedure
64
 
65
- ### Training hyperparameters
66
 
67
  The following hyperparameters were used during training:
68
 
69
- - learning_rate: 0.0005
70
- - train_batch_size: 32
71
- - eval_batch_size: 32
72
- - seed: 42
73
- - gradient_accumulation_steps: 2
74
- - total_train_batch_size: 64
75
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
76
- - lr_scheduler_type: linear
77
- - lr_scheduler_warmup_steps: 20000
78
- - num_epochs: 10.0
79
 
80
- ### Training results
81
 
82
  | Training Loss | Epoch | Step | Validation Loss |
83
  | :-----------: | :----: | :----: | :-------------: |
@@ -327,9 +335,28 @@ The following hyperparameters were used during training:
327
  | 0.3983 | 9.9213 | 244000 | 0.4026 |
328
  | 0.3997 | 9.9620 | 245000 | 0.4025 |
329
 
330
- ### Framework versions
331
 
332
- - Transformers 4.43.3
333
- - Pytorch 2.4.0+cu121
334
- - Datasets 2.20.0
335
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  tags:
3
+ - safe
4
+ - datamol-io
5
+ - molecule-design
6
+ - smiles
7
+ - generated_from_trainer
8
+ datasets:
9
+ - katielink/moses
10
  model-index:
11
+ - name: SAFE_20M
12
+ results: []
13
  ---
14
 
15
  # SAFE_20M
16
 
17
+ SAFE_20M is a transformer-based model designed for molecular generation tasks. This model was trained from scratch on the [MOSES](https://huggingface.co/datasets/katielink/moses) dataset, which has been converted from SMILES to the SAFE (SMILES Augmented For Encoding) format to enhance molecular representation for machine learning applications.
18
+
19
+ ## Evaluation Results
20
 
21
+ On the evaluation set, SAFE_20M achieved the following result:
22
 
23
+ - **Loss:** 0.4024
24
 
25
+ ## Model Description
26
 
27
+ SAFE_20M leverages the SAFE framework to generate valid and diverse molecular structures. By converting the MOSES dataset from SMILES to SAFE format, the model benefits from improved molecular encoding, facilitating better performance in various applications such as:
28
 
29
+ - **Drug Discovery:** Identifying potential drug candidates with desirable properties.
30
+ - **Materials Science:** Designing new materials with specific characteristics.
31
+ - **Chemical Engineering:** Innovating chemical processes and compounds.
32
+
33
+ ### SAFE Framework
34
+
35
+ The SAFE framework, integral to SAFE_20M, was introduced in the following paper:
36
 
37
  ```bibtex
38
  @article{noutahi2024gotta,
 
49
 
50
  We acknowledge and thank the authors for their valuable contribution to the field of molecular design.
51
 
52
+ ## Intended Uses & Limitations
53
+
54
+ ### Intended Uses
55
 
56
+ SAFE_20M is primarily intended for:
57
 
58
+ - **Generating Molecular Structures:** Creating novel molecules with desired properties.
59
+ - **Exploring Chemical Space:** Navigating the vast landscape of possible chemical compounds for research and development.
60
+ - **Assisting in Material Design:** Facilitating the creation of new materials with specific functionalities.
61
 
62
+ ### Limitations
63
 
64
+ - **Validation Required:** Outputs should be validated by domain experts before practical application.
65
+ - **Synthetic Feasibility:** Generated molecules may not always be synthetically feasible.
66
+ - **Dataset Scope:** The model's knowledge is limited to the chemical space represented in the MOSES dataset.
67
 
68
+ ## Training and Evaluation Data
69
 
70
+ The model was trained on the [MOSES (MOlecular SEtS)](https://huggingface.co/datasets/katielink/moses) dataset, a benchmark dataset for molecular generation. The dataset was converted from SMILES to the SAFE format to enhance molecular representation for machine learning tasks.
71
 
72
+ ## Training Procedure
73
 
74
+ ### Training Hyperparameters
75
 
76
  The following hyperparameters were used during training:
77
 
78
+ - **Learning Rate:** 0.0005
79
+ - **Training Batch Size:** 32
80
+ - **Evaluation Batch Size:** 32
81
+ - **Seed:** 42
82
+ - **Gradient Accumulation Steps:** 2
83
+ - **Total Training Batch Size:** 64
84
+ - **Optimizer:** Adam (betas=(0.9, 0.999), epsilon=1e-08)
85
+ - **Learning Rate Scheduler:** Linear with 20,000 warmup steps
86
+ - **Number of Epochs:** 10
 
87
 
88
+ ### Training Results
89
 
90
  | Training Loss | Epoch | Step | Validation Loss |
91
  | :-----------: | :----: | :----: | :-------------: |
 
335
  | 0.3983 | 9.9213 | 244000 | 0.4026 |
336
  | 0.3997 | 9.9620 | 245000 | 0.4025 |
337
 
338
+ ### Framework Versions
339
 
340
+ - **Transformers:** 4.43.3
341
+ - **PyTorch:** 2.4.0+cu121
342
+ - **Datasets:** 2.20.0
343
+ - **Tokenizers:** 0.19.1
344
+
345
+ ## Acknowledgements
346
+
347
+ We acknowledge and thank the authors of the SAFE framework for their valuable contribution to the field of molecular design.
348
+
349
+ ## References
350
+
351
+ ```bibtex
352
+ @article{noutahi2024gotta,
353
+ title={Gotta be SAFE: a new framework for molecular design},
354
+ author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
355
+ journal={Digital Discovery},
356
+ volume={3},
357
+ number={4},
358
+ pages={796--804},
359
+ year={2024},
360
+ publisher={Royal Society of Chemistry}
361
+ }
362
+ ```