Patrick von Platen PRO
LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. Note that in order to limit the required storage for preparin...
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories. Both "arxiv" and "pubmed" have two features: - article: the body of the document, pagragraphs seperated by "/n". - abstract: the abstract of the document, pagragraphs seperated by "/n". - ...