Request: Dataset access or script to recreate it?

by smbow - opened Oct 8, 2022

Oct 8, 2022

Hi there, thank you for hosting this more cross-domain model of the topical change task, building upon the legal documents dataset work.

Will you be willing to share the dataset or script? I am aiming to use a different model architecture to achieve similar results and would appreciate and be grateful if you could open-source the dataset.

dennlinger

Owner Oct 8, 2022

Hi Sam,
thanks for your interest in our work! I'll be happily sharing the script that we ultimately used for training.
As for the data, I am currently not entirely sure whether we still have that around; however, I should be able to provide you with the scripts that we used to extract it from a common Wikipedia dump.

Given that I currently have a deadline, I likely will be only able to follow up on this within the next week (I think by Wednesday I should have time).
Best,
Dennis

dennlinger

Owner Oct 13, 2022

Hi Sam,
a quick update from my side: I had to do some digging, but finally found the relevant data sources we used. They are unfortunately quite large (21 GB for training alone), so it might take some time for me to process and upload the dataset. I'll keep you posted on the progress!
Best,
Dennis

smbow

Oct 13, 2022

Thank you Dennis, and thanks for the updates!

dennlinger

Owner Oct 13, 2022

Hi Sam,
final update for now: I've published (and linked) the dataset we used for training, you can find it here.
I'm currently struggling with some limitations due to the file size, but will keep everything documented there and should be able to upload the missing training portion shortly.
I have also included a bit more details in the model card, which should help clarify some of the limitations of our model :)
Best of luck for your own training!
Dennis

dennlinger

Owner Jan 31, 2023

For others in the future, I again point to our publicly available version of the training dataset at this URL: https://huggingface.co/datasets/dennlinger/wiki-paragraphs

dennlinger changed discussion status to closed Jan 31, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment