Could one potentially train a mini-model based on this concept on synthetic structural data?

#11
by Mr8BitHK - opened

We have a large relational database describing biomedical relationships, which could be translated to synthetic data programatically. Could this concept be used to train based on such a dataset for Q&A?

I think it depends. If you want your model to be able to generalise could be interesting to try. If you only want to perform information retrieval I suggest retrieval-augmented generation directly performed on your database using a small LLM like phi to elaborate the informations in the context and provide the answer to the question.

Thanks for the quick response. Our problem with RAG is finding a balance between K (number of results returned) and context length, not even taking into account ontology. For example: genes could have hundreds of interactions, tens of thousands of citations, a few locations. Simple questions like which interactions are unique between two genes can fail. General embedding + vector databases have performed poorly on our use case as a result. This is why we are thinking of building a model instead.

I see, one potential idea could be to try with a code llm that is trained to translate question to SQL queries towards your db

We have a large relational database describing biomedical relationships, which could be translated to synthetic data programatically. Could this concept be used to train based on such a dataset for Q&A?

Yes

gugarosa changed discussion status to closed

Sign up or log in to comment