--- title: README emoji: 🐨 colorFrom: purple colorTo: indigo sdk: static pinned: false --- **💪 Feel free to join the organization if you want to add a dataset with a similar purpose :) Please [tell me](https://tillwenke.github.io/about/) about your dataset before asking to join the org.** To test your **RAG** and other **semantic information retrieval solutions** it would be powerful to have access to a dataset that consists of a text corpus, correct responses to queries (e.g. question-answer) to test the solution end-to-end and maybe even a set of relevant passages from the text corpus for each query to test the retrieval component separately as well. We call this a question-answer-passages dataset. There are plenty of large-scale datasets of this kind such as [Google's Natural Questions](https://ai.google.com/research/NaturalQuestions/). Still we lack such datasets that are **small-scale** and **narrow-domain** to just test our RAG solution quickly or to see how it performs in a certain domain context. We created this space to create a collections of such datasets to boost the developement of RAG solutions and welcome any feedback about how your ideal RAG-Dataset would look like. :) Datasets consist of: * A **text corpus** already split into passages, referencing passages by id. * A dataset for testing consistig of: * A **question**, and one or ideally both of the followin. * A correct **short answer**. * A **list of the passage ids** that are relevant to answer the question.