gbyuvd commited on
Commit
5223613
·
verified ·
1 Parent(s): 3773332

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -140,7 +140,7 @@ My initial attempt focused on training a sentence transformer based on SELFIES,
140
 
141
  The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from [COCONUTDB](https://coconut.naturalproducts.net/). After some initial training, I pushed [the model to Hugging Face](https://huggingface.co/gbyuvd/ChemEmbed-v01) to get some feedback. Gladly, [Tom Aarsen](https://huggingface.co/tomaarsen) provided [valuable suggestions](https://huggingface.co/gbyuvd/ChemEmbed-v01/discussions/1), including training a custom tokenizer, exploring [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), and considering training from scratch. The attempt to implement Tom's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.
142
 
143
- Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I'm eager to receive any feedback, so that I can improve both myself and future models/projects. A more detailed article discussing this project in details is coming soon.
144
 
145
  ## Training Details
146
 
 
140
 
141
  The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from [COCONUTDB](https://coconut.naturalproducts.net/). After some initial training, I pushed [the model to Hugging Face](https://huggingface.co/gbyuvd/ChemEmbed-v01) to get some feedback. Gladly, [Tom Aarsen](https://huggingface.co/tomaarsen) provided [valuable suggestions](https://huggingface.co/gbyuvd/ChemEmbed-v01/discussions/1), including training a custom tokenizer, exploring [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), and considering training from scratch. The attempt to implement Tom's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.
142
 
143
+ Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I am more than happy to receive any feedback, so that I can improve both myself and future models/projects. A more detailed article discussing this project in details is coming soon.
144
 
145
  ## Training Details
146