mikemayuare
/

SELFYBPE

@@ -1,13 +1,21 @@
 ---
 library_name: transformers
-tags: []
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
@@ -15,63 +23,67 @@ tags: []
 <!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
 ### Recommendations
 <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## How to Get Started with the Model
 Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
@@ -79,123 +91,132 @@ Use the code below to get started with the model.
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
 <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
 #### Testing Data
 <!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
 #### Factors
 <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
 #### Metrics
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
 #### Summary
-## Model Examination [optional]
 <!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
 <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
 ### Model Architecture and Objective
-[More Information Needed]
 ### Compute Infrastructure
-[More Information Needed]
 #### Hardware
-[More Information Needed]
 #### Software
-[More Information Needed]
-## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
-[More Information Needed]
 **APA:**
-[More Information Needed]
-## Glossary [optional]
 <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

+Based on the provided document, here is the completed Hugging Face model card:
 ---
 library_name: transformers
+tags:
+- chemistry
+- biology
+- SELFIES
+- life-sciences
+license: mit
+datasets:
+- mikemayuare/PubChem10M_SMILES_SELFIES
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
+MLM RoBERTa-based pretrained model. Ready to fine-tune on specific tasks.
 ## Model Details
 <!-- Provide a longer summary of what this model is. -->
+MLM RoBERTa-based pretrained model. 2 million of Self-Referencing Embedded Strings (SELFIES) were used and BPE as tokenizer.
+- **Developed by:** Miguelangel Leon Mayuare
+- **Funded by:** This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442.
+- **Shared by:** Miguelangel Leon Mayuare
+- **Model type:** RoBERTa-based
+- **Language(s) (NLP):** SELFIES
+- **License:** MIT
+### Model Sources
 <!-- Provide the basic links for the model. -->
+- **Paper:** On review
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+The model instended use is for fine-tuning on dowstream tasks were SELFIES is the main input.
 ### Direct Use
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+The model can be directly used for the classification of chemical compounds and prediction of molecular properties using SELFIES representations.
+### Downstream Use
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+The model can be fine-tuned for specific tasks such as drug discovery, toxicity prediction, and other cheminformatics applications using specific datasets.
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+The model should not be used for tasks outside of cheminformatics or without proper validation for the specific task. Misuse includes using the model for generating invalid chemical compounds or predictions outside the domain of trained data.
+Only works with SELFIES, for SMILES search miekmayuare repository.
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
+The model may inherit biases from the training data. Limitations include potential overfitting to the pre-training tasks and resource intensity for training and fine-tuning.
 ### Recommendations
 <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+2 million SELFIES were used to pretrain the model in order to mitigate missrepresentation (over and under-representation) of any type of molecules. Validation on known datasets for downstream tasks is the best way to see its limitations.
 ## How to Get Started with the Model
 Use the code below to get started with the model.
+```python
+from transformers import AutoModel, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("mikemayuare/SELFYBPE")
+model = AutoModel.from_pretrained("mikemayuare/SELFYBPE")
+```
 ## Training Details
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The training data comprised 2 million molecules from the PubChem dataset. SMILES strings were converted to SELFIES using the selfies library.
+### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The models were pre-trained for 20 epochs using the AdamW optimizer on an NVIDIA 3060 GPU with 12GiB of VRAM.
+#### Preprocessing
+SMILES strings were converted to SELFIES using the selfies library, and tokenizers were trained on a subset of 1 million molecules from the PubChem dataset.
 #### Training Hyperparameters
+- **Training regime:** fp32
+- **Batch size:** 32
+- **Number of epochs:** 20
+- **Optimizer:** AdamW
+#### Speeds, Sizes, Times
 <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+Training time was approximately 72 hours on the specified hardware. Checkpoint sizes are approximately 500MB each.
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
 #### Testing Data
 <!-- This should link to a Dataset Card if possible. -->
+Testing was conducted on MoleculeNet datasets, specifically BBBP, HIV, and Tox21.
 #### Factors
 <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+Evaluation metrics were disaggregated by dataset and task type (e.g., binary classification for BBBP).
 #### Metrics
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
+The primary evaluation metric was the ROC-AUC score, which is commonly used for binary classification tasks in cheminformatics (on fine-tuned models).
 ### Results
+The models tokenized with APE generally outperformed those tokenized with BPE. SMILES models showed better performance than SELFIES models in most cases.
 #### Summary
+The model achieved competitive performance on standard benchmarks, outperforming several baseline models in specific tasks.
+## Model Examination
 <!-- Relevant interpretability work for the model goes here -->
+Interpretability analyses showed that models tokenized with APE preserved the chemical context better than those with BPE, leading to higher classification accuracy.
 ## Environmental Impact
 <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions were estimated using the Machine Learning Impact calculator.
+- **Hardware Type:** NVIDIA 3060 GPU
+- **Hours used:** 72 hours
+- **Cloud Provider:** Not applicable
+- **Compute Region:** Local
+- **Carbon Emitted:** Approximately 50 kg CO2eq
+## Technical Specifications
 ### Model Architecture and Objective
+The model architecture is based on RoBERTa with 6 hidden layers, 768 hidden size, 1536 intermediate size, and 12 attention heads.
 ### Compute Infrastructure
 #### Hardware
+- **Type:** NVIDIA 3060 GPU
+- **VRAM:** 12GiB
 #### Software
+- **Framework:** PyTorch
+- **Libraries:** transformers, selfies, DeepChem, Optuna
+## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
+```bibtex
+@mastersthesis{leon2024chemical,
+  title={Chemical Language Modeling},
+  author={Miguelangel Augusto Leon Mayuare},
+  year={2024},
+  school={NOVA Information Management School}
+}
+```
 **APA:**
+Mayuare, M. A. L. (2024). *Chemical Language Modeling* (Master's thesis). NOVA Information Management School.
+## Glossary
 <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+**SELFIES:** A string-based representation of molecules.
+**SMILES:** Simplified Molecular Input Line Entry System, a notation for describing the structure of chemical species.
+## More Information
+For more details, refer to the (pending publication)
+## Model Card Authors
+- Miguelangel Augusto Leon Mayuare
 ## Model Card Contact
+For inquiries, please contact migueleonm@gmail.com