microsoft/phi-1_5 · Could one potentially train a mini-model based on this concept on synthetic structural data?

Sep 13, 2023

We have a large relational database describing biomedical relationships, which could be translated to synthetic data programatically. Could this concept be used to train based on such a dataset for Q&A?

davide221

Sep 13, 2023

I think it depends. If you want your model to be able to generalise could be interesting to try. If you only want to perform information retrieval I suggest retrieval-augmented generation directly performed on your database using a small LLM like phi to elaborate the informations in the context and provide the answer to the question.

Mr8BitHK

Sep 13, 2023

Thanks for the quick response. Our problem with RAG is finding a balance between K (number of results returned) and context length, not even taking into account ontology. For example: genes could have hundreds of interactions, tens of thousands of citations, a few locations. Simple questions like which interactions are unique between two genes can fail. General embedding + vector databases have performed poorly on our use case as a result. This is why we are thinking of building a model instead.

davide221

Sep 15, 2023

I see, one potential idea could be to try with a code llm that is trained to translate question to SQL queries towards your db

breadlicker45

Sep 15, 2023

We have a large relational database describing biomedical relationships, which could be translated to synthetic data programatically. Could this concept be used to train based on such a dataset for Q&A?

Yes

gugarosa changed discussion status to closed Nov 21, 2023