nixiesearch
/

nixie-querygen-v2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

shuttie commited on Mar 21, 2024

Commit

51e9929

·

1 Parent(s): 6335399

add link to doc2query paper

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -32,6 +32,8 @@ A [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) fine-tuned
 * synthetic query generation for downstream embedding fine-tuning tasks - when you have only documents and no queries/labels. Such task can be done with the [nixietune](https://github.com/nixiesearch/nixietune) toolkit, see the `nixietune.qgen.generate` recipe.
 * synthetic dataset expansion for further embedding training - when you DO have query-document pairs, but only a few. You can fine-tune the `nixie-querygen-v2` on existing pairs, and then expand your document corpus with synthetic queries (which are still based on your few real ones). See `nixietune.qgen.train` recipe.
 ## Training data
 We used [200k query-document pairs](https://huggingface.co/datasets/nixiesearch/query-positive-pairs-small) sampled randomly from a diverse set of IR datasets:

 * synthetic query generation for downstream embedding fine-tuning tasks - when you have only documents and no queries/labels. Such task can be done with the [nixietune](https://github.com/nixiesearch/nixietune) toolkit, see the `nixietune.qgen.generate` recipe.
 * synthetic dataset expansion for further embedding training - when you DO have query-document pairs, but only a few. You can fine-tune the `nixie-querygen-v2` on existing pairs, and then expand your document corpus with synthetic queries (which are still based on your few real ones). See `nixietune.qgen.train` recipe.
+The idea behind the approach is taken from the [doqT5query](https://github.com/castorini/docTTTTTquery) model. See the original paper [Rodrigo Nogueira and Jimmy Lin. From doc2query to docTTTTTquery.](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf)
 ## Training data
 We used [200k query-document pairs](https://huggingface.co/datasets/nixiesearch/query-positive-pairs-small) sampled randomly from a diverse set of IR datasets: