Request: Dataset access or script to recreate it?

#1
by smbow - opened

Hi there, thank you for hosting this more cross-domain model of the topical change task, building upon the legal documents dataset work.

Will you be willing to share the dataset or script? I am aiming to use a different model architecture to achieve similar results and would appreciate and be grateful if you could open-source the dataset.

Hi Sam,
thanks for your interest in our work! I'll be happily sharing the script that we ultimately used for training.
As for the data, I am currently not entirely sure whether we still have that around; however, I should be able to provide you with the scripts that we used to extract it from a common Wikipedia dump.

Given that I currently have a deadline, I likely will be only able to follow up on this within the next week (I think by Wednesday I should have time).
Best,
Dennis

Hi Sam,
a quick update from my side: I had to do some digging, but finally found the relevant data sources we used. They are unfortunately quite large (21 GB for training alone), so it might take some time for me to process and upload the dataset. I'll keep you posted on the progress!
Best,
Dennis

Thank you Dennis, and thanks for the updates!

Hi Sam,
final update for now: I've published (and linked) the dataset we used for training, you can find it here.
I'm currently struggling with some limitations due to the file size, but will keep everything documented there and should be able to upload the missing training portion shortly.
I have also included a bit more details in the model card, which should help clarify some of the limitations of our model :)
Best of luck for your own training!
Dennis

For others in the future, I again point to our publicly available version of the training dataset at this URL: https://huggingface.co/datasets/dennlinger/wiki-paragraphs

dennlinger changed discussion status to closed

Sign up or log in to comment