|
--- |
|
title: 🔍Wikipedia Twitter ChatGPT Memory Chat🏊 |
|
emoji: 🌟GPT🔍 |
|
colorFrom: green |
|
colorTo: yellow |
|
sdk: gradio |
|
sdk_version: 3.24.1 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
--- |
|
## ChatGPT Datasets 📚 |
|
- WebText |
|
- Common Crawl |
|
- BooksCorpus |
|
- English Wikipedia |
|
- Toronto Books Corpus |
|
- OpenWebText |
|
## ChatGPT Datasets - Details 📚 |
|
- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2. |
|
- [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext) |
|
- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3. |
|
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al. |
|
- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres. |
|
- [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al. |
|
- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017. |
|
- [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search |
|
- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto. |
|
- [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze. |
|
- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3. |
|
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al. |
|
|