Could one potentially train a mini-model based on this concept on synthetic structural data?
We have a large relational database describing biomedical relationships, which could be translated to synthetic data programatically. Could this concept be used to train based on such a dataset for Q&A?
I think it depends. If you want your model to be able to generalise could be interesting to try. If you only want to perform information retrieval I suggest retrieval-augmented generation directly performed on your database using a small LLM like phi to elaborate the informations in the context and provide the answer to the question.
Thanks for the quick response. Our problem with RAG is finding a balance between K (number of results returned) and context length, not even taking into account ontology. For example: genes could have hundreds of interactions, tens of thousands of citations, a few locations. Simple questions like which interactions are unique between two genes can fail. General embedding + vector databases have performed poorly on our use case as a result. This is why we are thinking of building a model instead.
I see, one potential idea could be to try with a code llm that is trained to translate question to SQL queries towards your db
We have a large relational database describing biomedical relationships, which could be translated to synthetic data programatically. Could this concept be used to train based on such a dataset for Q&A?
Yes