README.md · awacke1/Wikipedia-Twitter-ChatGPT-Memory-Chat at main

metadata

title: 🔍Wikipedia Twitter ChatGPT Memory Chat🏊
emoji: 🌟GPT🔍
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 3.24.1
app_file: app.py
pinned: false
license: mit

ChatGPT Datasets 📚

WebText
Common Crawl
BooksCorpus
English Wikipedia
Toronto Books Corpus
OpenWebText

ChatGPT Datasets - Details 📚

WebText: A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
- WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.
Common Crawl: A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
- Language Models are Few-Shot Learners by Brown et al.
BooksCorpus: A dataset of over 11,000 books from a variety of genres.
- Scalable Methods for 8 Billion Token Language Modeling by Zhu et al.
English Wikipedia: A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
- Improving Language Understanding by Generative Pre-Training Space for Wikipedia Search
Toronto Books Corpus: A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
- Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond by Schwenk and Douze.
OpenWebText: A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
- Language Models are Few-Shot Learners by Brown et al.