Update README.md
Browse files
README.md
CHANGED
|
@@ -140,7 +140,7 @@ My initial attempt focused on training a sentence transformer based on SELFIES,
|
|
| 140 |
|
| 141 |
The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from [COCONUTDB](https://coconut.naturalproducts.net/). After some initial training, I pushed [the model to Hugging Face](https://huggingface.co/gbyuvd/ChemEmbed-v01) to get some feedback. Gladly, [Tom Aarsen](https://huggingface.co/tomaarsen) provided [valuable suggestions](https://huggingface.co/gbyuvd/ChemEmbed-v01/discussions/1), including training a custom tokenizer, exploring [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), and considering training from scratch. The attempt to implement Tom's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.
|
| 142 |
|
| 143 |
-
Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I
|
| 144 |
|
| 145 |
## Training Details
|
| 146 |
|
|
|
|
| 140 |
|
| 141 |
The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from [COCONUTDB](https://coconut.naturalproducts.net/). After some initial training, I pushed [the model to Hugging Face](https://huggingface.co/gbyuvd/ChemEmbed-v01) to get some feedback. Gladly, [Tom Aarsen](https://huggingface.co/tomaarsen) provided [valuable suggestions](https://huggingface.co/gbyuvd/ChemEmbed-v01/discussions/1), including training a custom tokenizer, exploring [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), and considering training from scratch. The attempt to implement Tom's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.
|
| 142 |
|
| 143 |
+
Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I am more than happy to receive any feedback, so that I can improve both myself and future models/projects. A more detailed article discussing this project in details is coming soon.
|
| 144 |
|
| 145 |
## Training Details
|
| 146 |
|