Spaces:
Running
Running
title: README | |
emoji: π¨ | |
colorFrom: purple | |
colorTo: indigo | |
sdk: static | |
pinned: false | |
**πͺ Feel free to join the organization if you want to add a dataset with a similar purpose :) Please [tell me](https://tillwenke.github.io/about/) about your dataset before asking to join the org.** | |
To test your **RAG** and other **semantic information retrieval solutions** it would be powerful to have access to a dataset that consists of a text corpus, | |
correct responses to queries (e.g. question-answer) to test the solution end-to-end and maybe even a set of relevant passages | |
from the text corpus for each query to test the retrieval component separately as well. | |
We call this a question-answer-passages dataset. | |
There are plenty of large-scale datasets of this kind such as [Google's Natural Questions](https://ai.google.com/research/NaturalQuestions/). | |
Still we lack such datasets that are **small-scale** and **narrow-domain** to just test our RAG solution quickly or to see how it performs | |
in a certain domain context. | |
We created this space to create a collections of such datasets to boost the developement of RAG solutions and welcome any feedback about how your ideal RAG-Dataset would look like. :) | |
Datasets consist of: | |
* A **text corpus** already split into passages, referencing passages by id. | |
* A dataset for testing consistig of: | |
* A **question**, and one or ideally both of the followin. | |
* A correct **short answer**. | |
* A **list of the passage ids** that are relevant to answer the question. | |