How can I train Bloom on a specific set of texts?

#162
by boomer22 - opened

Hi,

I have a set of domain-specific texts I'd like to train Bloom on for querying purposes. This is purely out of curiosity for Bloom.

I am a developer with decades of experience so I'm not afraid to get my hands dirty, I just want some information on how/if I can do this and where to find more information. If possible, are there alternative systems you can recommend?

Thanks in advance.

I have been looking to try and understand this as well, I'd like to add knowledge to the corpus but I can't see a way that this can be done.

@yongzx can you offer any advice? Or can the corpus only be expanded through the original process?

BigScience Workshop org

I think it really depends on the type of knowledge you want to add to the training corpus.

What we've tried on is to add support to another language (https://arxiv.org/abs/2212.09535), where the new corpus is text in another language. We find that, given enough text, we can simply train on the new corpus with next word prediction objective (as in BLOOM pretraining). However, for bigger models exceeding 1.7B parameters, instead of finetuning the entire model, we recommend training only the adapters.

Currently, we are still exploring how to best combine the new corpus and the original ROOT corpus.

Hi, I have been doing exactly what you are talking about with my open source mobile text editor called Maker+ Ci. I have built a AI chat interface on top of Maker+ Ci called chatLink which uses files stored in M+ to set context/pre-prompts for the bloom model which is the LLM that chatLink is using. So in chatLink the user can easily select the file that they want to set as context, which tells bloom how to behave when a prompt is submitted in chatLink. I have created lots of cool prompts for bloom, including using it as a terminal, mental health chatbot, Ai personal assistant, chatbots with different personalities such as the intj personality and the infp personality and the most recent one which is a qna model built from prompts and no fine tuning. You can jump between all these different contexts in chatLink easily. I’m happy to share all these different prompts including teaching how to use Maker+ Ci and chatLink which are both open source. If you are interested please join my discord server https://discord.gg/47pXk7CY and feel free to come and ask any questions you like about how to work with bloom in your own projects or how work with bloom in Maker+ Ci and chatLink there.

Sign up or log in to comment