mikemayuare commited on
Commit
2b35f73
·
verified ·
1 Parent(s): 7dbc548

Update README.md

Browse files

Completing model card.

Files changed (1) hide show
  1. README.md +82 -61
README.md CHANGED
@@ -1,13 +1,21 @@
 
 
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
4
  ---
5
 
6
  # Model Card for Model ID
7
 
8
  <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
 
12
  ## Model Details
13
 
@@ -15,63 +23,67 @@ tags: []
15
 
16
  <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
  <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
38
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
39
 
40
  ### Direct Use
41
 
42
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
  ### Out-of-Scope Use
53
 
54
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
 
57
 
58
  ## Bias, Risks, and Limitations
59
 
60
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
  ### Recommendations
65
 
66
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
71
 
72
  Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
 
 
75
 
76
  ## Training Details
77
 
@@ -79,123 +91,132 @@ Use the code below to get started with the model.
79
 
80
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
92
 
93
  #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
102
 
103
  ## Evaluation
104
 
105
  <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
108
-
109
  #### Testing Data
110
 
111
  <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
  #### Factors
116
 
117
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
 
121
  #### Metrics
122
 
123
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
 
127
  ### Results
128
 
129
- [More Information Needed]
130
 
131
  #### Summary
132
 
 
133
 
134
-
135
- ## Model Examination [optional]
136
 
137
  <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
  ## Environmental Impact
142
 
143
  <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
  ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
  ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
-
163
  #### Hardware
164
 
165
- [More Information Needed]
 
166
 
167
  #### Software
168
 
169
- [More Information Needed]
 
170
 
171
- ## Citation [optional]
172
 
173
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
  **BibTeX:**
176
 
177
- [More Information Needed]
 
 
 
 
 
 
 
178
 
179
  **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
  <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
  ## Model Card Contact
198
 
199
- [More Information Needed]
200
-
201
 
 
1
+ Based on the provided document, here is the completed Hugging Face model card:
2
+
3
  ---
4
  library_name: transformers
5
+ tags:
6
+ - chemistry
7
+ - biology
8
+ - SELFIES
9
+ - life-sciences
10
+ license: mit
11
+ datasets:
12
+ - mikemayuare/PubChem10M_SMILES_SELFIES
13
  ---
14
 
15
  # Model Card for Model ID
16
 
17
  <!-- Provide a quick summary of what the model is/does. -->
18
+ MLM RoBERTa-based pretrained model. Ready to fine-tune on specific tasks.
 
19
 
20
  ## Model Details
21
 
 
23
 
24
  <!-- Provide a longer summary of what this model is. -->
25
 
26
+ MLM RoBERTa-based pretrained model. 2 million of Self-Referencing Embedded Strings (SELFIES) were used and BPE as tokenizer.
27
 
28
+ - **Developed by:** Miguelangel Leon Mayuare
29
+ - **Funded by:** This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442.
30
+ - **Shared by:** Miguelangel Leon Mayuare
31
+ - **Model type:** RoBERTa-based
32
+ - **Language(s) (NLP):** SELFIES
33
+ - **License:** MIT
 
34
 
35
+ ### Model Sources
36
 
37
  <!-- Provide the basic links for the model. -->
38
 
39
+ - **Paper:** On review
 
 
40
 
41
  ## Uses
42
 
43
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+ The model instended use is for fine-tuning on dowstream tasks were SELFIES is the main input.
45
 
46
  ### Direct Use
47
 
48
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
49
 
50
+ The model can be directly used for the classification of chemical compounds and prediction of molecular properties using SELFIES representations.
51
 
52
+ ### Downstream Use
53
 
54
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
55
 
56
+ The model can be fine-tuned for specific tasks such as drug discovery, toxicity prediction, and other cheminformatics applications using specific datasets.
57
 
58
  ### Out-of-Scope Use
59
 
60
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
61
 
62
+ The model should not be used for tasks outside of cheminformatics or without proper validation for the specific task. Misuse includes using the model for generating invalid chemical compounds or predictions outside the domain of trained data.
63
+ Only works with SELFIES, for SMILES search miekmayuare repository.
64
 
65
  ## Bias, Risks, and Limitations
66
 
67
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
 
69
+ The model may inherit biases from the training data. Limitations include potential overfitting to the pre-training tasks and resource intensity for training and fine-tuning.
70
 
71
  ### Recommendations
72
 
73
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
 
75
+ 2 million SELFIES were used to pretrain the model in order to mitigate missrepresentation (over and under-representation) of any type of molecules. Validation on known datasets for downstream tasks is the best way to see its limitations.
76
 
77
  ## How to Get Started with the Model
78
 
79
  Use the code below to get started with the model.
80
 
81
+ ```python
82
+ from transformers import AutoModel, AutoTokenizer
83
+
84
+ tokenizer = AutoTokenizer.from_pretrained("mikemayuare/SELFYBPE")
85
+ model = AutoModel.from_pretrained("mikemayuare/SELFYBPE")
86
+ ```
87
 
88
  ## Training Details
89
 
 
91
 
92
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
93
 
94
+ The training data comprised 2 million molecules from the PubChem dataset. SMILES strings were converted to SELFIES using the selfies library.
95
 
96
+ ### Training Procedure
97
 
98
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
99
 
100
+ The models were pre-trained for 20 epochs using the AdamW optimizer on an NVIDIA 3060 GPU with 12GiB of VRAM.
101
 
102
+ #### Preprocessing
103
 
104
+ SMILES strings were converted to SELFIES using the selfies library, and tokenizers were trained on a subset of 1 million molecules from the PubChem dataset.
105
 
106
  #### Training Hyperparameters
107
 
108
+ - **Training regime:** fp32
109
+ - **Batch size:** 32
110
+ - **Number of epochs:** 20
111
+ - **Optimizer:** AdamW
112
 
113
+ #### Speeds, Sizes, Times
114
 
115
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
116
 
117
+ Training time was approximately 72 hours on the specified hardware. Checkpoint sizes are approximately 500MB each.
118
 
119
  ## Evaluation
120
 
121
  <!-- This section describes the evaluation protocols and provides the results. -->
122
 
 
 
123
  #### Testing Data
124
 
125
  <!-- This should link to a Dataset Card if possible. -->
126
 
127
+ Testing was conducted on MoleculeNet datasets, specifically BBBP, HIV, and Tox21.
128
 
129
  #### Factors
130
 
131
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
132
 
133
+ Evaluation metrics were disaggregated by dataset and task type (e.g., binary classification for BBBP).
134
 
135
  #### Metrics
136
 
137
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
138
 
139
+ The primary evaluation metric was the ROC-AUC score, which is commonly used for binary classification tasks in cheminformatics (on fine-tuned models).
140
 
141
  ### Results
142
 
143
+ The models tokenized with APE generally outperformed those tokenized with BPE. SMILES models showed better performance than SELFIES models in most cases.
144
 
145
  #### Summary
146
 
147
+ The model achieved competitive performance on standard benchmarks, outperforming several baseline models in specific tasks.
148
 
149
+ ## Model Examination
 
150
 
151
  <!-- Relevant interpretability work for the model goes here -->
152
 
153
+ Interpretability analyses showed that models tokenized with APE preserved the chemical context better than those with BPE, leading to higher classification accuracy.
154
 
155
  ## Environmental Impact
156
 
157
  <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
158
 
159
+ Carbon emissions were estimated using the Machine Learning Impact calculator.
160
 
161
+ - **Hardware Type:** NVIDIA 3060 GPU
162
+ - **Hours used:** 72 hours
163
+ - **Cloud Provider:** Not applicable
164
+ - **Compute Region:** Local
165
+ - **Carbon Emitted:** Approximately 50 kg CO2eq
166
 
167
+ ## Technical Specifications
168
 
169
  ### Model Architecture and Objective
170
 
171
+ The model architecture is based on RoBERTa with 6 hidden layers, 768 hidden size, 1536 intermediate size, and 12 attention heads.
172
 
173
  ### Compute Infrastructure
174
 
 
 
175
  #### Hardware
176
 
177
+ - **Type:** NVIDIA 3060 GPU
178
+ - **VRAM:** 12GiB
179
 
180
  #### Software
181
 
182
+ - **Framework:** PyTorch
183
+ - **Libraries:** transformers, selfies, DeepChem, Optuna
184
 
185
+ ## Citation
186
 
187
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
188
 
189
  **BibTeX:**
190
 
191
+ ```bibtex
192
+ @mastersthesis{leon2024chemical,
193
+ title={Chemical Language Modeling},
194
+ author={Miguelangel Augusto Leon Mayuare},
195
+ year={2024},
196
+ school={NOVA Information Management School}
197
+ }
198
+ ```
199
 
200
  **APA:**
201
 
202
+ Mayuare, M. A. L. (2024). *Chemical Language Modeling* (Master's thesis). NOVA Information Management School.
203
 
204
+ ## Glossary
205
 
206
  <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
207
 
208
+ **SELFIES:** A string-based representation of molecules.
209
+ **SMILES:** Simplified Molecular Input Line Entry System, a notation for describing the structure of chemical species.
210
 
211
+ ## More Information
212
 
213
+ For more details, refer to the (pending publication)
214
 
215
+ ## Model Card Authors
216
 
217
+ - Miguelangel Augusto Leon Mayuare
218
 
219
  ## Model Card Contact
220
 
221
+ For inquiries, please contact migueleonm@gmail.com
 
222