awacke1 commited on
Commit
112aad3
β€’
1 Parent(s): 2434827

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +154 -2
app.py CHANGED
@@ -151,6 +151,154 @@ def text_generate_old(prompt, generated_txt):
151
  display_output = display_output[:-1]
152
  return display_output, new_prompt
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  # An insightful and engaging self-care health care demo
155
  demo = gr.Blocks()
156
 
@@ -163,13 +311,17 @@ with demo:
163
  )
164
 
165
  with gr.Row():
166
- generated_txt = gr.Textbox(lines=5, visible=True)
167
 
168
  with gr.Row():
169
- Thoughts = gr.Textbox(lines=10, visible=True)
170
 
171
  gen = gr.Button("Discover Health Insights")
172
 
 
 
 
 
173
  gen.click(
174
  text_generate,
175
  inputs=[input_prompt, generated_txt],
 
151
  display_output = display_output[:-1]
152
  return display_output, new_prompt
153
 
154
+
155
+ Markdown = """
156
+
157
+
158
+ # 2023 Bloom Spaces
159
+
160
+ 1. Model: https://huggingface.co/bigscience/bloom
161
+ 2. Bloom Theme Generator: https://huggingface.co/spaces/awacke1/Write-Stories-Using-Bloom
162
+ 3. Bloom Ghotwriter : https://huggingface.co/spaces/awacke1/Bloom.Generative.Writer
163
+ 4. https://huggingface.co/spaces/awacke1/Bloom.Human.Feedback.File.Ops
164
+ 5. https://huggingface.co/spaces/awacke1/04-AW-StorywriterwMem
165
+
166
+ 🌸 πŸ”Ž Bloom Searcher πŸ” 🌸
167
+
168
+ Tool design for Roots: [URL](https://huggingface.co/spaces/bigscience-data/scisearch/blob/main/roots_search_tool_specs.pdf).
169
+
170
+ Bloom on Wikipedia: [URL](https://en.wikipedia.org/wiki/BLOOM_(language_model)).
171
+
172
+ Bloom Video Playlist: [URL](https://www.youtube.com/playlist?list=PLHgX2IExbFouqnsIqziThlPCX_miiDq14).
173
+
174
+ Access full corpus check [URL](https://forms.gle/qyYswbEL5kA23Wu99).
175
+
176
+ Big Science - How to get started
177
+
178
+ Big Science is a 176B parameter new ML model that was trained on a set of datasets for Natural Language processing, and many other tasks that are not yet explored.. Below is the set of the papers, models, links, and datasets around big science which promises to be the best, most recent large model of its kind benefitting all science pursuits.
179
+
180
+ Model: https://huggingface.co/bigscience/bloom
181
+
182
+ Papers:
183
+ BLOOM: A 176B-Parameter Open-Access Multilingual Language Model https://arxiv.org/abs/2211.05100
184
+ Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism https://arxiv.org/abs/1909.08053
185
+ 8-bit Optimizers via Block-wise Quantization https://arxiv.org/abs/2110.02861
186
+ Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation https://arxiv.org/abs/2108.12409
187
+ https://huggingface.co/models?other=doi:10.57967/hf/0003
188
+ 217 Other Models optimizing use of bloom via specialization: https://huggingface.co/models?other=bloom
189
+
190
+ Datasets:
191
+ Universal Dependencies: https://paperswithcode.com/dataset/universal-dependencies
192
+ WMT 2014: https://paperswithcode.com/dataset/wmt-2014
193
+ The Pile: https://paperswithcode.com/dataset/the-pile
194
+ HumanEval: https://paperswithcode.com/dataset/humaneval
195
+ FLORES-101: https://paperswithcode.com/dataset/flores-101
196
+ CrowS-Pairs: https://paperswithcode.com/dataset/crows-pairs
197
+ WikiLingua: https://paperswithcode.com/dataset/wikilingua
198
+ MTEB: https://paperswithcode.com/dataset/mteb
199
+ xP3: https://paperswithcode.com/dataset/xp3
200
+ DiaBLa: https://paperswithcode.com/dataset/diabla
201
+
202
+ Evals:
203
+ https://github.com/AaronCWacker/evals
204
+
205
+ ## Language Models πŸ—£οΈ
206
+ πŸ† Bloom sets new record for most performant and efficient AI model in science! 🌸
207
+ ### Comparison of Large Language Models
208
+ | Model Name | Model Size (in Parameters) |
209
+ | ----------------- | -------------------------- |
210
+ | BigScience-tr11-176B | 176 billion |
211
+ | GPT-3 | 175 billion |
212
+ | OpenAI's DALL-E 2.0 | 500 million |
213
+ | NVIDIA's Megatron | 8.3 billion |
214
+ | Transformer-XL | 250 million |
215
+ | XLNet | 210 million |
216
+
217
+ ## ChatGPT Datasets πŸ“š
218
+ - WebText
219
+ - Common Crawl
220
+ - BooksCorpus
221
+ - English Wikipedia
222
+ - Toronto Books Corpus
223
+ - OpenWebText
224
+
225
+ ## ChatGPT Datasets - Details πŸ“š
226
+ - **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
227
+ - [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
228
+ - **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
229
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
230
+ - **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
231
+ - [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
232
+ - **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
233
+ - [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
234
+ - **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
235
+ - [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
236
+ - **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
237
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
238
+
239
+ ## Big Science Model πŸš€
240
+ - πŸ“œ Papers:
241
+ 1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [Paper](https://arxiv.org/abs/2211.05100)
242
+ 2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Paper](https://arxiv.org/abs/1909.08053)
243
+ 3. 8-bit Optimizers via Block-wise Quantization [Paper](https://arxiv.org/abs/2110.02861)
244
+ 4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [Paper](https://arxiv.org/abs/2108.12409)
245
+ 5. [Other papers related to Big Science](https://huggingface.co/models?other=doi:10.57967/hf/0003)
246
+ 6. [217 other models optimized for use with Bloom](https://huggingface.co/models?other=bloom)
247
+
248
+ - πŸ“š Datasets:
249
+ **Datasets:**
250
+ 1. - **Universal Dependencies:** A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing.
251
+ - [Universal Dependencies official website.](https://universaldependencies.org/)
252
+ 2. - **WMT 2014:** The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages.
253
+ - [WMT14 website.](http://www.statmt.org/wmt14/)
254
+ 3. - **The Pile:** An English language corpus of diverse text, sourced from various places on the internet.
255
+ - [The Pile official website.](https://pile.eleuther.ai/)
256
+ 4. - **HumanEval:** A dataset of English sentences, annotated with human judgments on a range of linguistic qualities.
257
+ - [HumanEval: An Evaluation Benchmark for Language Understanding](https://github.com/google-research-datasets/humaneval) by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes.
258
+ 5. - **FLORES-101:** A dataset of parallel sentences in 101 languages, designed for multilingual machine translation.
259
+ - [FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding](https://flores101.opennmt.net/) by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra.
260
+ 6. - **CrowS-Pairs:** A dataset of sentence pairs, designed for evaluating the plausibility of generated text.
261
+ - [CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments](https://github.com/stanford-cogsci/crows-pairs) by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong.
262
+ 7. - **WikiLingua:** A dataset of parallel sentences in 75 languages, sourced from Wikipedia.
263
+ - [WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification](https://arxiv.org/abs/2105.08031) by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi.
264
+ 8. - **MTEB:** A dataset of English sentences, annotated with their entailment relationships with respect to other sentences.
265
+ - [Multi-Task Evaluation Benchmark for Natural Language Inference](https://github.com/google-research-datasets/mteb) by MichaΕ‚ Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor.
266
+ 9. - **xP3:** A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences.
267
+ - [xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context](https://github.com/nyu-dl/xp3) by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge.
268
+ 10. - **DiaBLa:** A dataset of English dialogue, annotated with dialogue acts.
269
+ - [A Large-Scale Corpus for Conversation Disentanglement](https://github.com/HLTCHKUST/DiaBLA) by Samuel Broscheit, AntΓ³nio Branco, and AndrΓ© F. T. Martins.
270
+
271
+ - πŸ“š Dataset Papers with Code
272
+ 1. [Universal Dependencies](https://paperswithcode.com/dataset/universal-dependencies)
273
+ 2. [WMT 2014](https://paperswithcode.com/dataset/wmt-2014)
274
+ 3. [The Pile](https://paperswithcode.com/dataset/the-pile)
275
+ 4. [HumanEval](https://paperswithcode.com/dataset/humaneval)
276
+ 5. [FLORES-101](https://paperswithcode.com/dataset/flores-101)
277
+ 6. [CrowS-Pairs](https://paperswithcode.com/dataset/crows-pairs)
278
+ 7. [WikiLingua](https://paperswithcode.com/dataset/wikilingua)
279
+ 8. [MTEB](https://paperswithcode.com/dataset/mteb)
280
+ 9. [xP3](https://paperswithcode.com/dataset/xp3)
281
+ 10. [DiaBLa](https://paperswithcode.com/dataset/diabla)
282
+
283
+ # Deep RL ML Strategy 🧠
284
+ The AI strategies are:
285
+ - Language Model Preparation using Human Augmented with Supervised Fine Tuning πŸ€–
286
+ - Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank 🎁
287
+ - Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score 🎯
288
+ - Proximal Policy Optimization Fine Tuning 🀝
289
+ - Variations - Preference Model Pretraining πŸ€”
290
+ - Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution πŸ“Š
291
+ - Online Version Getting Feedback πŸ’¬
292
+ - OpenAI - InstructGPT - Humans generate LM Training Text πŸ”
293
+ - DeepMind - Advantage Actor Critic Sparrow, GopherCite 🦜
294
+ - Reward Model Human Prefence Feedback πŸ†
295
+ For more information on specific techniques and implementations, check out the following resources:
296
+ - OpenAI's paper on [GPT-3](https://arxiv.org/abs/2005.14165) which details their Language Model Preparation approach
297
+ - DeepMind's paper on [SAC](https://arxiv.org/abs/1801.01290) which describes the Advantage Actor Critic algorithm
298
+ - OpenAI's paper on [Reward Learning](https://arxiv.org/abs/1810.06580) which explains their approach to training Reward Models
299
+ - OpenAI's blog post on [GPT-3's fine-tuning process](https://openai.com/blog/fine-tuning-gpt-3/)
300
+ """
301
+
302
  # An insightful and engaging self-care health care demo
303
  demo = gr.Blocks()
304
 
 
311
  )
312
 
313
  with gr.Row():
314
+ generated_txt = gr.Textbox(lines=2, visible=True)
315
 
316
  with gr.Row():
317
+ Thoughts = gr.Textbox(lines=4, visible=True)
318
 
319
  gen = gr.Button("Discover Health Insights")
320
 
321
+ with gr.Row():
322
+ gr.Markdown(Markdown)
323
+
324
+
325
  gen.click(
326
  text_generate,
327
  inputs=[input_prompt, generated_txt],