dkleczek commited on Jul 21, 2021

Commit

7848bdf

1 Parent(s): 6b1fd70

praying now

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +22 -0
README.md +223 -0
added_tokens.json +3 -0
allegro_reviews/config.json +3 -0
allegro_reviews/create_config_allegro.py +6 -0
allegro_reviews/events.out.tfevents.1625481245.t1v-n-5d840006-w-0.20165.3.v2 +3 -0
allegro_reviews/events.out.tfevents.1625482183.t1v-n-5d840006-w-0.22476.3.v2 +3 -0
allegro_reviews/events.out.tfevents.1625482418.t1v-n-5d840006-w-0.24291.3.v2 +3 -0
allegro_reviews/tokenizer.json +3 -0
allegro_reviews/train_tokenizer_allegro.py +26 -0
ckpt-7000/config.json +3 -0
ckpt-7000/flax_model.msgpack +3 -0
ckpt-7000/opt_state.msgpack +3 -0
ckpt-7000/training_state.json +3 -0
config.json +3 -0
convert_to_pytorch.py +5 -0
create_config.py +6 -0
events.out.tfevents.1625408122.t1v-n-5d840006-w-0.4909.3.v2 +3 -0
events.out.tfevents.1625465634.t1v-n-5d840006-w-0.10317.3.v2 +3 -0
events.out.tfevents.1625468593.t1v-n-5d840006-w-0.12620.3.v2 +3 -0
events.out.tfevents.1625474538.t1v-n-5d840006-w-0.15018.3.v2 +3 -0
events.out.tfevents.1625488422.t1v-n-5d840006-w-0.26135.3.v2 +3 -0
events.out.tfevents.1625560105.t1v-n-5d840006-w-0.32054.3.v2 +3 -0
events.out.tfevents.1625561792.t1v-n-5d840006-w-0.33847.3.v2 +3 -0
events.out.tfevents.1625563613.t1v-n-5d840006-w-0.39089.3.v2 +3 -0
events.out.tfevents.1625645925.t1v-n-5d840006-w-0.21118.3.v2 +3 -0
events.out.tfevents.1625646523.t1v-n-5d840006-w-0.24030.3.v2 +3 -0
events.out.tfevents.1625648517.t1v-n-5d840006-w-0.3756.3.v2 +3 -0
events.out.tfevents.1625652835.t1v-n-5d840006-w-0.5744.3.v2 +3 -0
events.out.tfevents.1625653275.t1v-n-5d840006-w-0.7412.3.v2 +3 -0
events.out.tfevents.1625829811.t1v-n-5d840006-w-0.18706.3.v2 +3 -0
events.out.tfevents.1625845134.t1v-n-5d840006-w-0.23366.3.v2 +3 -0
events.out.tfevents.1625848627.t1v-n-5d840006-w-0.26741.3.v2 +3 -0
events.out.tfevents.1625850120.t1v-n-5d840006-w-0.28732.3.v2 +3 -0
events.out.tfevents.1625850884.t1v-n-5d840006-w-0.30623.3.v2 +3 -0
events.out.tfevents.1625862814.t1v-n-5d840006-w-0.33177.3.v2 +3 -0
events.out.tfevents.1625886911.t1v-n-5d840006-w-0.22644.3.v2 +3 -0
events.out.tfevents.1626080463.t1v-n-5d840006-w-0.102926.3.v2 +3 -0
events.out.tfevents.1626087582.t1v-n-5d840006-w-0.107030.3.v2 +3 -0
events.out.tfevents.1626100637.t1v-n-5d840006-w-0.124085.3.v2 +3 -0
events.out.tfevents.1626269397.t1v-n-5d840006-w-0.280196.3.v2 +3 -0
events.out.tfevents.1626412410.t1v-n-5d840006-w-0.404523.3.v2 +3 -0
flax_model.msgpack +3 -0
gender_bias.jpeg +0 -0
hate_by_ethnicity.png +0 -0
hate_by_gender.png +0 -0
merges.txt +3 -0
papuGaPT2_bias_analysis.ipynb +0 -0
papuGaPT2_text_generation.ipynb +1051 -0
pretrain_model.sh +21 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,22 @@

+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tar.gz filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.log filter=lfs diff=lfs merge=lfs -text
+*.wandb filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
+*.txt filter=lfs diff=lfs merge=lfs -text
+*.yaml filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,223 @@

+---
+language: pl
+tags:
+- text-generation
+widget:
+- text: "Najsmaczniejszy polski owoc to"
+---
+# papuGaPT2 - Polish GPT2 language model
+[GPT2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) was released in 2019 and surprised many with its text generation capability. However, up until very recently, we have not had a strong text generation model in Polish language, which limited the research opportunities for Polish NLP practitioners. With the release of this model, we hope to enable such research.
+Our model follows the standard GPT2 architecture and training approach. We are using a causal language modeling (CLM) objective, which means that the model is trained to predict the next word (token) in a sequence of words (tokens).
+## Datasets
+We used the Polish subset of the [multilingual Oscar corpus](https://www.aclweb.org/anthology/2020.acl-main.156) to train the model in a self-supervised fashion.
+```
+from datasets import load_dataset
+dataset = load_dataset('oscar', 'unshuffled_deduplicated_pl')
+```
+## Intended uses & limitations
+The raw model can be used for text generation or fine-tuned for a downstream task. The model has been trained on data scraped from the web, and can generate text containing intense violence, sexual situations, coarse language and drug use. It also reflects the biases from the dataset (see below for more details). These limitations are likely to transfer to the fine-tuned models as well. At this stage, we do not recommend using the model beyond research.
+## Bias Analysis
+There are many sources of bias embedded in the model and we caution to be mindful of this while exploring the capabilities of this model. We have started a very basic analysis of bias that you can see in [this notebook](https://huggingface.co/flax-community/papuGaPT2/blob/main/papuGaPT2_bias_analysis.ipynb).
+### Gender Bias
+As an example, we generated 50 texts starting with prompts "She/He works as". The image below presents the resulting word clouds of female/male professions. The most salient terms for male professions are: teacher, sales representative, programmer. The most salient terms for female professions are: model, caregiver, receptionist, waitress.
+![gender bias](https://huggingface.co/flax-community/papuGaPT2/raw/main/gender_bias.jpeg)
+### Ethnicity/Nationality/Gender Bias
+We generated 1000 texts to assess bias across ethnicity, nationality and gender vectors. We created prompts with the following scheme:
+* Person - in Polish this is a single word that differentiates both nationality/ethnicity and gender. We assessed the following 5 nationalities/ethnicities: German, Romani, Jewish, Ukrainian, Neutral. The neutral group used generic pronounts ("He/She").
+* Topic - we used 5 different topics:
+  * random act: *entered home*
+  * said: *said*
+  * works as: *works as*
+  * intent: Polish *niech* which combined with *he* would roughly translate to *let him ...*
+  * define: *is*
+Each combination of 5 nationalities x 2 genders x 5 topics had 20 generated texts.
+We used a model trained on [Polish Hate Speech corpus](https://huggingface.co/datasets/hate_speech_pl) to obtain the probability that each generated text contains hate speech. To avoid leakage, we removed the first word identifying the nationality/ethnicity and gender from the generated text before running the hate speech detector.
+The following tables and charts demonstrate the intensity of hate speech associated with the generated texts. There is a very clear effect where each of the ethnicities/nationalities score higher than the neutral baseline.
+![hate score by ethnicity](https://huggingface.co/flax-community/papuGaPT2/raw/main/hate_by_ethnicity.png)
+Looking at the gender dimension we see higher hate score associated with males vs. females.
+![hate score by gender](https://huggingface.co/flax-community/papuGaPT2/raw/main/hate_by_gender.png)
+We don't recommend using the GPT2 model beyond research unless a clear mitigation for the biases is provided.
+## Training procedure
+### Training scripts
+We used the [causal language modeling script for Flax](https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_clm_flax.py). We would like to thank the authors of that script as it allowed us to complete this training in a very short time!
+### Preprocessing and Training Details
+The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50,257. The inputs are sequences of 512 consecutive tokens.
+We have trained the model on a single TPUv3 VM, and due to unforeseen events the training run was split in 3 parts, each time resetting from the final checkpoint with a new optimizer state:
+1. LR 1e-3, bs 64, linear schedule with warmup for 1000 steps, 10 epochs, stopped after 70,000 steps at eval loss 3.206 and perplexity 24.68
+2. LR 3e-4, bs 64, linear schedule with warmup for 5000 steps, 7 epochs, stopped after 77,000 steps at eval loss 3.116 and perplexity 22.55
+3. LR 2e-4, bs 64, linear schedule with warmup for 5000 steps, 3 epochs, stopped after 91,000 steps at eval loss 3.082 and perplexity 21.79
+## Evaluation results
+We trained the model on 95% of the dataset and evaluated both loss and perplexity on 5% of the dataset. The final checkpoint evaluation resulted in:
+* Evaluation loss: 3.082
+* Perplexity: 21.79
+## How to use
+You can use the model either directly for text generation (see example below), by extracting features, or for further fine-tuning. We have prepared a notebook with text generation examples [here](https://huggingface.co/flax-community/papuGaPT2/blob/main/papuGaPT2_text_generation.ipynb) including different decoding methods, bad words suppression, few- and zero-shot learning demonstrations.
+### Text generation
+Let's first start with the text-generation pipeline. When prompting for the best Polish poet, it comes up with a pretty reasonable text, highlighting one of the most famous Polish poets, Adam Mickiewicz.
+```python
+from transformers import pipeline, set_seed
+generator = pipeline('text-generation', model='flax-community/papuGaPT2')
+set_seed(42)
+generator('Największym polskim poetą był')
+>>> [{'generated_text': 'Największym polskim poetą był Adam Mickiewicz - uważany za jednego z dwóch geniuszów języka polskiego. "Pan Tadeusz" był jednym z najpopularniejszych dzieł w historii Polski. W 1801 został wystawiony publicznie w Teatrze Wilama Horzycy. Pod jego'}]
+```
+The pipeline uses `model.generate()` method in the background. In [our notebook](https://huggingface.co/flax-community/papuGaPT2/blob/main/papuGaPT2_text_generation.ipynb) we demonstrate different decoding methods we can use with this method, including greedy search, beam search, sampling, temperature scaling, top-k and top-p sampling. As an example, the below snippet uses sampling among the 50 most probable tokens at each stage (top-k) and among the tokens that jointly represent 95% of the probability distribution (top-p). It also returns 3 output sequences.
+```python
+from transformers import AutoTokenizer, AutoModelWithLMHead
+model = AutoModelWithLMHead.from_pretrained('flax-community/papuGaPT2')
+tokenizer = AutoTokenizer.from_pretrained('flax-community/papuGaPT2')
+set_seed(42) # reproducibility
+input_ids = tokenizer.encode('Największym polskim poetą był', return_tensors='pt')
+sample_outputs = model.generate(
+    input_ids,
+    do_sample=True,
+    max_length=50,
+    top_k=50,
+    top_p=0.95,
+    num_return_sequences=3
+)
+print("Output:\
+" + 100 * '-')
+for i, sample_output in enumerate(sample_outputs):
+  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
+>>> Output:
+>>> ----------------------------------------------------------------------------------------------------
+>>> 0: Największym polskim poetą był Roman Ingarden. Na jego wiersze i piosenki oddziaływały jego zamiłowanie do przyrody i przyrody. Dlatego też jako poeta w czasie pracy nad utworami i wierszami z tych wierszy, a następnie z poezji własnej - pisał
+>>> 1: Największym polskim poetą był Julian Przyboś, którego poematem „Wierszyki dla dzieci”.
+>>> W okresie międzywojennym, pod hasłem „Papież i nie tylko” Polska, jak większość krajów europejskich, była państwem faszystowskim.
+>>> Prócz
+>>> 2: Największym polskim poetą był Bolesław Leśmian, który był jego tłumaczem, a jego poezja tłumaczyła na kilkanaście języków.
+>>> W 1895 roku nakładem krakowskiego wydania "Scientio" ukazała się w języku polskim powieść W krainie kangurów
+```
+### Avoiding Bad Words
+You may want to prevent certain words from occurring in the generated text. To avoid displaying really bad words in the notebook, let's pretend that we don't like certain types of music to be advertised by our model. The prompt says: *my favorite type of music is*.
+```python
+input_ids = tokenizer.encode('Mój ulubiony gatunek muzyki to', return_tensors='pt')
+bad_words = [' disco', ' rock', ' pop', ' soul', ' reggae', ' hip-hop']
+bad_word_ids = []
+for bad_word in bad_words:
+  ids = tokenizer(bad_word).input_ids
+  bad_word_ids.append(ids)
+sample_outputs = model.generate(
+    input_ids,
+    do_sample=True,
+    max_length=20,
+    top_k=50,
+    top_p=0.95,
+    num_return_sequences=5,
+    bad_words_ids=bad_word_ids
+)
+print("Output:\
+" + 100 * '-')
+for i, sample_output in enumerate(sample_outputs):
+  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
+>>> Output:
+>>> ----------------------------------------------------------------------------------------------------
+>>> 0: Mój ulubiony gatunek muzyki to muzyka klasyczna. Nie wiem, czy to kwestia sposobu, w jaki gramy,
+>>> 1: Mój ulubiony gatunek muzyki to reggea. Zachwycają mnie piosenki i piosenki muzyczne o ducho
+>>> 2: Mój ulubiony gatunek muzyki to rockabilly, ale nie lubię też punka. Moim ulubionym gatunkiem
+>>> 3: Mój ulubiony gatunek muzyki to rap, ale to raczej się nie zdarza w miejscach, gdzie nie chodzi
+>>> 4: Mój ulubiony gatunek muzyki to metal aranżeje nie mam pojęcia co mam robić. Co roku,
+```
+Ok, it seems this worked: we can see *classical music, rap, metal* among the outputs. Interestingly, *reggae* found a way through via a misspelling *reggea*. Take it as a caution to be careful with curating your bad word lists!
+### Few Shot Learning
+Let's see now if our model is able to pick up training signal directly from a prompt, without any finetuning. This approach was made really popular with GPT3, and while our model is definitely less powerful, maybe it can still show some skills! If you'd like to explore this topic in more depth, check out [the following article](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api) which we used as reference.
+```python
+prompt = """Tekst: "Nienawidzę smerfów!"
+Sentyment: Negatywny
+###
+Tekst: "Jaki piękny dzień 👍"
+Sentyment: Pozytywny
+###
+Tekst: "Jutro idę do kina"
+Sentyment: Neutralny
+###
+Tekst: "Ten przepis jest świetny!"
+Sentyment:"""
+res = generator(prompt, max_length=85, temperature=0.5, end_sequence='###', return_full_text=False, num_return_sequences=5,)
+for x in res:
+  print(res[i]['generated_text'].split(' ')[1])
+>>> Pozytywny
+>>> Pozytywny
+>>> Pozytywny
+>>> Pozytywny
+>>> Pozytywny
+```
+It looks like our model is able to pick up some signal from the prompt. Be careful though, this capability is definitely not mature and may result in spurious or biased responses.
+### Zero-Shot Inference
+Large language models are known to store a lot of knowledge in its parameters. In the example below, we can see that our model has learned the date of an important event in Polish history, the battle of Grunwald.
+```python
+prompt = "Bitwa pod Grunwaldem miała miejsce w roku"
+input_ids = tokenizer.encode(prompt, return_tensors='pt')
+# activate beam search and early_stopping
+beam_outputs = model.generate(
+    input_ids,
+    max_length=20,
+    num_beams=5,
+    early_stopping=True,
+    num_return_sequences=3
+)
+print("Output:\
+" + 100 * '-')
+for i, sample_output in enumerate(beam_outputs):
+  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
+>>> Output:
+>>> ----------------------------------------------------------------------------------------------------
+>>> 0: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie pod
+>>> 1: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie pokona
+>>> 2: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie,
+```
+## BibTeX entry and citation info
+```bibtex
+@misc{papuGaPT2,
+  title={papuGaPT2 - Polish GPT2 language model},
+  url={https://huggingface.co/flax-community/papuGaPT2},
+  author={Wojczulis, Michał and Kłeczek, Dariusz},
+  year={2021}
+}
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f73effd45f282fdecbce3d5bda192b346d1e2e5dc024d4493ff276656001a5b6
+size 24

allegro_reviews/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1ace5aef92f7880ccb5fd0e7c5f65556d6914dbd134fa1672b46a0533225c036
+size 811

allegro_reviews/create_config_allegro.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from transformers import GPT2Config
+model_dir = "."  # ${MODEL_DIR}
+config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
+config.save_pretrained(model_dir)

allegro_reviews/events.out.tfevents.1625481245.t1v-n-5d840006-w-0.20165.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25a5b7d6e069647cf953e1684211cf4b87049ae4e05610e37b1047966bd36fcc
+size 40

allegro_reviews/events.out.tfevents.1625482183.t1v-n-5d840006-w-0.22476.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee76cbdc38f6bec33ee28c5225264d95b8d46c0a2941ce59fbe8893f798a3de8
+size 40

allegro_reviews/events.out.tfevents.1625482418.t1v-n-5d840006-w-0.24291.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d20520f97baa97ebd08bbf9f66afb294613261a1661dbd9bf18ca39b4258e03d
+size 40

allegro_reviews/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c1735fd67aa6471a45e6baf09a106fdd7545046f3a805b0820a5d5fcb34ccf76
+size 1515050

allegro_reviews/train_tokenizer_allegro.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from datasets import load_dataset
+from tokenizers import trainers, Tokenizer, normalizers, ByteLevelBPETokenizer
+model_dir = "."  # ${MODEL_DIR}
+# load dataset
+dataset = load_dataset("allegro_reviews", split="train")
+# Instantiate tokenizer
+tokenizer = ByteLevelBPETokenizer()
+def batch_iterator(batch_size=1000):
+    for i in range(0, len(dataset), batch_size):
+        yield dataset[i: i + batch_size]["text"]
+# Customized training
+tokenizer.train_from_iterator(batch_iterator(), vocab_size=50265, min_frequency=2, special_tokens=[
+    "<s>",
+    "<pad>",
+    "</s>",
+    "<unk>",
+    "<mask>",
+])
+# Save files to disk
+tokenizer.save(f"{model_dir}/tokenizer.json")

ckpt-7000/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2639ebf1ac7da23195fad0d3961b5051a0d21058e49211160e5ef0aaac020621
+size 864

ckpt-7000/flax_model.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d426922657592daf71b1b3b88dc9099cde4696dd4bc9b73556888b869decb784
+size 497764120

ckpt-7000/opt_state.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:186303788c88a7a93fdbcd9f97729a9041ebc27bcae5d66f5a60efd41c249912
+size 995528480

ckpt-7000/training_state.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:72047b995289dd00fe7fd487482e84c2640772ccda4a8dd248fa4dcb041f71eb
+size 14

config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2639ebf1ac7da23195fad0d3961b5051a0d21058e49211160e5ef0aaac020621
+size 864

convert_to_pytorch.py ADDED Viewed

	@@ -0,0 +1,5 @@

+#!/usr/bin/env python3
+from transformers import GPT2LMHeadModel
+model = GPT2LMHeadModel.from_pretrained("./", from_flax=True)
+model.save_pretrained("./")

create_config.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from transformers import GPT2Config
+model_dir = "."  # ${MODEL_DIR}
+config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
+config.save_pretrained(model_dir)

events.out.tfevents.1625408122.t1v-n-5d840006-w-0.4909.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4f3d64a34ca00c3be72105da0664557fff01b50fc812802428144cebca87b35
+size 40

events.out.tfevents.1625465634.t1v-n-5d840006-w-0.10317.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f8ebd5f1ae292f7e94936111697f725be49810a334c1913a7d4fa8520b588dc
+size 61182

events.out.tfevents.1625468593.t1v-n-5d840006-w-0.12620.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1d4bb7621dd88a65736f55b305b26ebe509542fe9d277208ecf7b196c30b9a38
+size 281684

events.out.tfevents.1625474538.t1v-n-5d840006-w-0.15018.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:973ce04e1a3c163e06174b81a01067ea2564aae7d7d23128f83236e096dcde6b
+size 447251

events.out.tfevents.1625488422.t1v-n-5d840006-w-0.26135.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f1e4f47367e373d6e85d822a0489901f7914fdb74f55226fdf9660e27d7dbb70
+size 40

events.out.tfevents.1625560105.t1v-n-5d840006-w-0.32054.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:145110e582bd6ffa469bd70a6994e8fb7607eef00b32bac499277125e0c76f08
+size 147065

events.out.tfevents.1625561792.t1v-n-5d840006-w-0.33847.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:33764cac9e2b30a832ce9801b7a442440556a8ffd4944e94b65c8499dda6b5c9
+size 147065

events.out.tfevents.1625563613.t1v-n-5d840006-w-0.39089.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b63c776e86a45848fc976c6ac2978493911c7582801e84fb8741d7d54b54c789
+size 9512225

events.out.tfevents.1625645925.t1v-n-5d840006-w-0.21118.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4a9d6cc813f0ab93d9a607b4959ad92e69e211feae5dd0ea6541ae546e5fe99
+size 40

events.out.tfevents.1625646523.t1v-n-5d840006-w-0.24030.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d018c6ab315bf844d970f72e79fc335650adcdaf67093c5079fa6f802ccb2198
+size 40

events.out.tfevents.1625648517.t1v-n-5d840006-w-0.3756.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b27d199598f0cda4401d0990d1ea9ce3aef0865c8ce08b57a9f2f3c4ed4c780
+size 40

events.out.tfevents.1625652835.t1v-n-5d840006-w-0.5744.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51252162ec50163993bc7c712a4c9f79bb20e036bbca188cbec4181d2a33b0ee
+size 40

events.out.tfevents.1625653275.t1v-n-5d840006-w-0.7412.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8d5b8a36445ca8ac2e698b10725deca9add93bc732622d141b9a4ed5c2a8d945
+size 17423021

events.out.tfevents.1625829811.t1v-n-5d840006-w-0.18706.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9553b7cf078fa9afe1364d9edd4c482ae47089c72d79438382efd71e1c7e1d80
+size 220906

events.out.tfevents.1625845134.t1v-n-5d840006-w-0.23366.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d05f73015c7d3fcef29fa5a3783fa71061e8f9058326d44181aab1e9499818f5
+size 180

events.out.tfevents.1625848627.t1v-n-5d840006-w-0.26741.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7348cd7908eedb0f28fad1858fca9100d72f314ffdab2df7d5ddb14612d54910
+size 180

events.out.tfevents.1625850120.t1v-n-5d840006-w-0.28732.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c4da6dbdd6b6875786a92d3c57d533a99ffb94a070dde23c30df16140b8bcab8
+size 40

events.out.tfevents.1625850884.t1v-n-5d840006-w-0.30623.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:706b0e4a11361ad090a5255c0cbdb33fcb9acadfac53218442717c938279aefa
+size 1029349

events.out.tfevents.1625862814.t1v-n-5d840006-w-0.33177.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7492648e3b1447fcf7d888343ff46e01fd3e13bd509d7bc9edc3ae9e8d12ced3
+size 514496

events.out.tfevents.1625886911.t1v-n-5d840006-w-0.22644.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:295d632404865620140afe6b59ae69790e38090faddf0b8f823322037d68814f
+size 8313281

events.out.tfevents.1626080463.t1v-n-5d840006-w-0.102926.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:18a9e63d81d2a4da3bbf9ce3622d6024691dc0ffe3e427bb28f64fe070157d69
+size 40

events.out.tfevents.1626087582.t1v-n-5d840006-w-0.107030.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76f4ca950c81e17eba462c306a30d8375b137702fdff33d20af833fbf2cd9842
+size 1029207

events.out.tfevents.1626100637.t1v-n-5d840006-w-0.124085.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8382ca0e5eb6ced66cf9c2aa3c00157ef0f8bd8c199e15bbddde539a14789a71
+size 11443277

events.out.tfevents.1626269397.t1v-n-5d840006-w-0.280196.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ed5103f11503f393b5bb5609f2052a4e4cd95a06b500a2f1e7eaa5d86235a741
+size 13529845

events.out.tfevents.1626412410.t1v-n-5d840006-w-0.404523.3.v2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ce17d9c1158c87ad9958e3c38db67cfecef07098f86568962a1456c33417bba3
+size 13529845

flax_model.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8bdc00b2ca54a7c2a6d99e950fcb45f81ccdfc20652a6d5020643a9bc37ff77d
+size 497764120

gender_bias.jpeg ADDED Viewed

hate_by_ethnicity.png ADDED Viewed

hate_by_gender.png ADDED Viewed

merges.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:20832466756a988386123195ca6a4d1ecf92f0c1ff346872412fa54a8a2cb179
+size 546522

papuGaPT2_bias_analysis.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

papuGaPT2_text_generation.ipynb ADDED Viewed

	@@ -0,0 +1,1051 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "papuGaPT2_text_generation.ipynb",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-jlP8InZ6FuU"
+      },
+      "source": [
+        "# Examples of generating text with papuGaPT2 - Polish GPT2 language model\n",
+        "\n",
+        "This notebook intends to show some examples of generating text with the Polish GPT2 model, [papuGaPT2](https://huggingface.co/flax-community/papuGaPT2)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "zNXhY6w7oAY7",
+        "outputId": "229305ac-1892-4603-9698-0dcdfada1ce2"
+      },
+      "source": [
+        "!pip install transformers -qq"
+      ],
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "\u001b[K     |████████████████████████████████| 2.5MB 5.0MB/s \n",
+            "\u001b[K     |████████████████████████████████| 901kB 35.2MB/s \n",
+            "\u001b[K     |████████████████████████████████| 3.3MB 38.3MB/s \n",
+            "\u001b[?25h"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "d_XIbTMDoLeN"
+      },
+      "source": [
+        "from transformers import pipeline, set_seed\n",
+        "from transformers import AutoTokenizer, AutoModelWithLMHead"
+      ],
+      "execution_count": 20,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "o47RrqSU-hnS",
+        "outputId": "081a2675-2b8d-4832-c9fb-6becc1e52c13"
+      },
+      "source": [
+        "model = AutoModelWithLMHead.from_pretrained('flax-community/papuGaPT2')\n",
+        "tokenizer = AutoTokenizer.from_pretrained('flax-community/papuGaPT2')\n",
+        "set_seed(42) # reproducibility"
+      ],
+      "execution_count": 21,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "/usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.\n",
+            "  FutureWarning,\n",
+            "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
+          ],
+          "name": "stderr"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9DjG3LKELhAz"
+      },
+      "source": [
+        "## Text Generation\n",
+        "\n",
+        "Let's first start with the text-generation pipeline. When prompting for the best Polish poet, it comes up with a pretty reasonable text, highlighting one of the most famous Polish poets, Adam Mickiewicz. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "s3mDGuxGoOA2",
+        "outputId": "0b58cd6d-2cac-44f8-81d6-bf9a5790b217"
+      },
+      "source": [
+        "generator = pipeline('text-generation', model='flax-community/papuGaPT2')"
+      ],
+      "execution_count": 22,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
+          ],
+          "name": "stderr"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "iTPH2S-rL_xn",
+        "outputId": "3a2165ee-348f-4c6e-eb5c-2cd92435357d"
+      },
+      "source": [
+        "generator('Największym polskim poetą był')"
+      ],
+      "execution_count": 40,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "[{'generated_text': 'Największym polskim poetą był Adam Mickiewicz - uważany za jednego z dwóch geniuszów języka polskiego. \"Pan Tadeusz\" był jednym z najpopularniejszych dzieł w historii Polski. W 1801 został wystawiony publicznie w Teatrze Wilama Horzycy. Pod jego'}]"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 40
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xTZtviLSLsYf"
+      },
+      "source": [
+        "Let's now explore the text generation/decoding method in more detail. The following code and examples were adapted from Patrick von Platen's [excellent article](https://huggingface.co/blog/how-to-generate).\n",
+        "\n",
+        "\n",
+        "#### Greedy Search\n",
+        "\n",
+        "In this approach, we pick the most probable token at each step during the generation. As we can see, this results in a lot of repetitions. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "A8sspEnO-X6W",
+        "outputId": "68f3ba22-491f-4776-f384-f98886876352"
+      },
+      "source": [
+        "# encode context the generation is conditioned on\n",
+        "input_ids = tokenizer.encode('Największym polskim poetą był', return_tensors='pt')\n",
+        "\n",
+        "# generate text until the output length (which includes the context length) reaches 50\n",
+        "greedy_output = model.generate(input_ids, max_length=50)\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))"
+      ],
+      "execution_count": 25,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Największym polskim poetą był Julian Tuwim, który w latach 60. i 70. był jednym z najbardziej znanych poetów. W latach 70. i 80. był jednym z najbardziej znanych poetów w Polsce.\n",
+            "W latach 70. i 80. Tuwi\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ADNi9ehHOIJy"
+      },
+      "source": [
+        "#### Beam Search\n",
+        "\n",
+        "Beam search allows us to maximize the probability of the entire sequence of generated tokens, as we search through the tree of possible options for the next probable token. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "hUmnyzJU-fXR",
+        "outputId": "63bf0414-8854-49bc-e137-c8fed8746c81"
+      },
+      "source": [
+        "# activate beam search and early_stopping\n",
+        "beam_output = model.generate(\n",
+        "    input_ids, \n",
+        "    max_length=50, \n",
+        "    num_beams=5, \n",
+        "    early_stopping=True\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "print(tokenizer.decode(beam_output[0], skip_special_tokens=True))"
+      ],
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n",
+            "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.\n",
+            "To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)\n",
+            "  return torch.floor_divide(self, other)\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i dorosłych, a także dla dzieci i młodzieży, m.in. dla Jana Brzechwy, Juliana Tuwima, Jana Brzechwy, Jana Brzechwy i wielu innych.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jSVLNwCWOjuC"
+      },
+      "source": [
+        "#### N-gram repetitions\n",
+        "\n",
+        "We can prevent the generated text from repeating n-grams like this. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "2QeDJh5R_5bo",
+        "outputId": "a0c530ef-adcc-4b78-b91f-a051742e0f10"
+      },
+      "source": [
+        "# set no_repeat_ngram_size to 2\n",
+        "beam_output = model.generate(\n",
+        "    input_ids, \n",
+        "    max_length=50, \n",
+        "    num_beams=5, \n",
+        "    no_repeat_ngram_size=2, \n",
+        "    early_stopping=True\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "print(tokenizer.decode(beam_output[0], skip_special_tokens=True))"
+      ],
+      "execution_count": 27,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Bolesława Leśmiana,\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "C1QtiC5HOsOn"
+      },
+      "source": [
+        "#### Multiple Output Sentences\n",
+        "\n",
+        "We can ask the model to generate several output sentences. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ELSiU-nEAHY6",
+        "outputId": "aa1416b4-2cdd-4c6e-c5bb-775c194e811b"
+      },
+      "source": [
+        "# set return_num_sequences > 1\n",
+        "beam_outputs = model.generate(\n",
+        "    input_ids, \n",
+        "    max_length=50, \n",
+        "    num_beams=5, \n",
+        "    no_repeat_ngram_size=2, \n",
+        "    num_return_sequences=5, \n",
+        "    early_stopping=True\n",
+        ")\n",
+        "\n",
+        "# now we have 3 output sequences\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "for i, beam_output in enumerate(beam_outputs):\n",
+        "  print(\"{}: {}\".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))"
+      ],
+      "execution_count": 28,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "0: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Bolesława Leśmiana,\n",
+            "1: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Jana Lechonia\n",
+            "2: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Czesława Janczarskiego\n",
+            "3: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Czesława Miłosza,\n",
+            "4: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej i wielu innych.\n",
+            "\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "SkAV930BO3Zz"
+      },
+      "source": [
+        "#### Sampling\n",
+        "\n",
+        "To produce more interesting text, instead of picking the most likely choice, we can sample next token from the probability distribution learned by our model. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "4Yw7ZJi0AOa0",
+        "outputId": "b249b80a-8108-4e06-dbfe-f1749862c6fd"
+      },
+      "source": [
+        "# activate sampling and deactivate top_k by setting top_k sampling to 0\n",
+        "sample_output = model.generate(\n",
+        "    input_ids, \n",
+        "    do_sample=True, \n",
+        "    max_length=50, \n",
+        "    top_k=0\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
+      ],
+      "execution_count": 29,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Największym polskim poetą był Paweł Jasienica, postać barwna, pełna temperamentów, jakże zacna kobieta, Brat naszego serca dziś utarte cyruliki, kulon, Kościuszko Juliusz Polski Prowuaja Kozacyczcyca\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "h7IlhqK1PGyr"
+      },
+      "source": [
+        "#### Temperature scaling\n",
+        "\n",
+        "If the model picks a very low-probability token, this can lead to gibberish results. We can reduce this risk by sharpening the distribution with temperature. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "E-_lundzAfSc",
+        "outputId": "8ef81b22-caa4-40a1-e935-aec0146d7ea5"
+      },
+      "source": [
+        "# use temperature to decrease the sensitivity to low probability candidates\n",
+        "sample_output = model.generate(\n",
+        "    input_ids, \n",
+        "    do_sample=True, \n",
+        "    max_length=50, \n",
+        "    top_k=0, \n",
+        "    temperature=0.8\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
+      ],
+      "execution_count": 31,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Największym polskim poetą był Adam Zagajewski. Zdjęcie poniżej pochodzi z 2010 roku.\n",
+            "W „Gazecie Wyborczej” ukazał się nowy tekst Adama Zagajewskiego. Piszemy w nim o… Bolku i Lolku z „Niedzieli”.\n",
+            "ZW\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Gbe5_Z1kPUlH"
+      },
+      "source": [
+        "#### Top-k Sampling\n",
+        "\n",
+        "We can also ask the model to only pick tokens from the list of k most probable tokens. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "6eMOD-VeAvlR",
+        "outputId": "dd3257ac-713d-471d-e793-3e8dd11b47f3"
+      },
+      "source": [
+        "# set top_k to 50\n",
+        "sample_output = model.generate(\n",
+        "    input_ids, \n",
+        "    do_sample=True, \n",
+        "    max_length=50, \n",
+        "    top_k=50\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
+      ],
+      "execution_count": 32,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Największym polskim poetą był Stanisław Lem, który zasłynął z antyutopii, a także wielkim poczuciem humoru, wykazując się niezwykłą inteligencją. Poeci o jego twórczości mówią, że jest „żywym malarzem języka polskiego, a jednocześnie\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "UrzIElatPkqW"
+      },
+      "source": [
+        "#### Top-p Sampling\n",
+        "\n",
+        "Rather than picking among the k most probable tokens, we can decide to pick from the tokens that sum up to p probability. This way, we can give our text generation more freedom when many tokens are feasible, and narrow its focus when only a few options make sense.  We can also combine top-k and top-p sampling. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Sk_tAsLcA94W",
+        "outputId": "22b86f18-c43d-4bf0-9ae1-24a970e3ed1a"
+      },
+      "source": [
+        "# deactivate top_k sampling and sample only from 93% most likely words\n",
+        "sample_output = model.generate(\n",
+        "    input_ids, \n",
+        "    do_sample=True, \n",
+        "    max_length=50, \n",
+        "    top_p=0.93, \n",
+        "    top_k=0\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
+      ],
+      "execution_count": 37,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Największym polskim poetą był sobie Andrzej Poniedzielski, do którego wroc. to jako autor: Adrian Waksmundzki. Powstało 13 utworów poetyckich, przedstawionych w formie prozatorskiej, poetyckiej i scenicznej, jak\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "zo0irbRWBIOH",
+        "outputId": "5d30d98c-5f7e-4392-d9d1-e5dcae91ae57"
+      },
+      "source": [
+        "# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3\n",
+        "sample_outputs = model.generate(\n",
+        "    input_ids,\n",
+        "    do_sample=True, \n",
+        "    max_length=50, \n",
+        "    top_k=50, \n",
+        "    top_p=0.95, \n",
+        "    num_return_sequences=3\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "for i, sample_output in enumerate(sample_outputs):\n",
+        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
+      ],
+      "execution_count": 38,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "0: Największym polskim poetą był Roman Ingarden. Na jego wiersze i piosenki oddziaływały jego zamiłowanie do przyrody i przyrody. Dlatego też jako poeta w czasie pracy nad utworami i wierszami z tych wierszy, a następnie z poezji własnej - pisał\n",
+            "1: Największym polskim poetą był Julian Przyboś, którego poematem „Wierszyki dla dzieci”.\n",
+            "W okresie międzywojennym, pod hasłem „Papież i nie tylko” Polska, jak większość krajów europejskich, była państwem faszystowskim.\n",
+            "Prócz\n",
+            "2: Największym polskim poetą był Bolesław Leśmian, który był jego tłumaczem, a jego poezja tłumaczyła na kilkanaście języków.\n",
+            "W 1895 roku nakładem krakowskiego wydania \"Scientio\" ukazała się w języku polskim powieść W krainie kangurów\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "cO2sDlX0QZ4N"
+      },
+      "source": [
+        "## Avoiding Bad Words\n",
+        "\n",
+        "You may want to prevent certain words from occuring in the generated text. To avoid displaying really bad words in the notebook, let's pretend that we don't like certain types of music to be advertised by our model. The prompt says: *my favorite type of music is*. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Da2O9jNmQvie",
+        "outputId": "a686c703-377e-4a3d-d557-59e061050ecb"
+      },
+      "source": [
+        "# encode context the generation is conditioned on\n",
+        "input_ids = tokenizer.encode('Mój ulubiony gatunek muzyki to', return_tensors='pt')\n",
+        "\n",
+        "sample_outputs = model.generate(\n",
+        "    input_ids,\n",
+        "    do_sample=True, \n",
+        "    max_length=20, \n",
+        "    top_k=50, \n",
+        "    top_p=0.95, \n",
+        "    num_return_sequences=5\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "for i, sample_output in enumerate(sample_outputs):\n",
+        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
+      ],
+      "execution_count": 49,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "0: Mój ulubiony gatunek muzyki to rock i pop. U nas bardzo, bardzo często króluje rock i pop.\n",
+            "1: Mój ulubiony gatunek muzyki to disco, czyli tango, a od 10.05 także fokstro\n",
+            "2: Mój ulubiony gatunek muzyki to soul i reggae. Kocham hiphop i ska, to są moi\n",
+            "3: Mój ulubiony gatunek muzyki to hip hop i wszelkiego rodzaju metal, głównie industrialne brzmienia (metal,\n",
+            "4: Mój ulubiony gatunek muzyki to oczywiście soul, do dzisiaj pamiętam swój zachwyt nad głosem Damiena Per\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "hFnNWFkSYzOx"
+      },
+      "source": [
+        "Now let's prevent the model from generating text containing these words: *disco, rock, pop, soul, reggae, hip-hop*. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "fcnODcEeBkGr"
+      },
+      "source": [
+        "bad_words = [' disco', ' rock', ' pop', ' soul', ' reggae', ' hip-hop']\n",
+        "bad_word_ids = []\n",
+        "for bad_word in bad_words: \n",
+        "  ids = tokenizer(bad_word).input_ids\n",
+        "  bad_word_ids.append(ids)"
+      ],
+      "execution_count": 77,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "JAr0EmJwRmka",
+        "outputId": "94c463ae-c269-4577-a1ba-74dc528732ba"
+      },
+      "source": [
+        "sample_outputs = model.generate(\n",
+        "    input_ids,\n",
+        "    do_sample=True, \n",
+        "    max_length=20, \n",
+        "    top_k=50, \n",
+        "    top_p=0.95, \n",
+        "    num_return_sequences=5,\n",
+        "    bad_words_ids=bad_word_ids\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "for i, sample_output in enumerate(sample_outputs):\n",
+        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
+      ],
+      "execution_count": 76,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "0: Mój ulubiony gatunek muzyki to muzyka klasyczna. Nie wiem, czy to kwestia sposobu, w jaki gramy,\n",
+            "1: Mój ulubiony gatunek muzyki to reggea. Zachwycają mnie piosenki i piosenki muzyczne o ducho\n",
+            "2: Mój ulubiony gatunek muzyki to rockabilly, ale nie lubię też punka. Moim ulubionym gatunkiem\n",
+            "3: Mój ulubiony gatunek muzyki to rap, ale to raczej się nie zdarza w miejscach, gdzie nie chodzi\n",
+            "4: Mój ulubiony gatunek muzyki to metal aranżeje nie mam pojęcia co mam robić. Co roku,\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "g080rafsZEqo"
+      },
+      "source": [
+        "Ok, it seems this worked: we can see *classical music, rap, metal* among the outputs. Interestingly, *reggae* found a way through via a misspelling *reggea*. Take it as a caution to be careful with curating your bad word lists!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "nGzC7t6HaC4n"
+      },
+      "source": [
+        "## Few Shot Learning\n",
+        "\n",
+        "Let's see now if our model is able to pick up training signal directly from a prompt, without any finetuning. This approach was made really popular with GPT3, and while our model is definitely less powerful, maybe it can still show some skills! If you'd like to explore this topic in more depth, check out [the following article](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api) which we used as reference."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "WqAYyfWZaCBd"
+      },
+      "source": [
+        "prompt = \"\"\"Tekst: \"Nienawidzę smerfów!\"\n",
+        "Sentyment: Negatywny\n",
+        "###\n",
+        "Tekst: \"Jaki piękny dzień 👍\"\n",
+        "Sentyment: Pozytywny\n",
+        "###\n",
+        "Tekst: \"Jutro idę do kina\"\n",
+        "Sentyment: Neutralny\n",
+        "###\n",
+        "Tekst: \"Ten przepis jest świetny!\"\n",
+        "Sentyment:\"\"\""
+      ],
+      "execution_count": 134,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "OXex5Zh8aSe2",
+        "outputId": "2efcd460-fe1a-4d97-c740-d5d3a034fb20"
+      },
+      "source": [
+        "res = generator(prompt, max_length=85, temperature=0.5, end_sequence='###', return_full_text=False, num_return_sequences=5,)\n",
+        "for x in res: \n",
+        "  print(res[i]['generated_text'].split(' ')[1])"
+      ],
+      "execution_count": 135,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Pozytywny\n",
+            "Pozytywny\n",
+            "Pozytywny\n",
+            "Pozytywny\n",
+            "Pozytywny\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "mP-hSxPBb5ky"
+      },
+      "source": [
+        "prompt = \"\"\"Tekst: \"Nienawidzę smerfów!\"\n",
+        "Sentyment: Negatywny\n",
+        "###\n",
+        "Tekst: \"Jaki piękny dzień 👍\"\n",
+        "Sentyment: Pozytywny\n",
+        "###\n",
+        "Tekst: \"Jutro idę do kina\"\n",
+        "Sentyment: Neutralny\n",
+        "###\n",
+        "Tekst: \"No po prostu beznadzieja\"\n",
+        "Sentyment:\"\"\""
+      ],
+      "execution_count": 136,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "wi5i1Dl5bemF",
+        "outputId": "455e6602-03d0-480f-b306-e94a6022f403"
+      },
+      "source": [
+        "res = generator(prompt, max_length=85, temperature=0.5, end_sequence='###', return_full_text=False, num_return_sequences=5,)\n",
+        "for x in res: \n",
+        "  print(res[i]['generated_text'].split(' ')[1])"
+      ],
+      "execution_count": 137,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Negatywny\n",
+            "Negatywny\n",
+            "Negatywny\n",
+            "Negatywny\n",
+            "Negatywny\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "e96CRXtHcFfg"
+      },
+      "source": [
+        "prompt = \"\"\"Tekst: \"Nienawidzę smerfów!\"\n",
+        "Sentyment: Negatywny\n",
+        "###\n",
+        "Tekst: \"Jaki piękny dzień 👍\"\n",
+        "Sentyment: Pozytywny\n",
+        "###\n",
+        "Tekst: \"Jutro idę do kina\"\n",
+        "Sentyment: Neutralny\n",
+        "###\n",
+        "Tekst: \"Przyjechał wczoraj wieczorem.\"\n",
+        "Sentyment:\"\"\""
+      ],
+      "execution_count": 140,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "FsCeE80QcNUY",
+        "outputId": "ea6ff86b-8adb-4b5a-bcaa-8b893a825aa5"
+      },
+      "source": [
+        "res = generator(prompt, max_length=85, temperature=0.5, end_sequence='###', return_full_text=False, num_return_sequences=5,)\n",
+        "for x in res: \n",
+        "  print(res[i]['generated_text'].split(' ')[1])"
+      ],
+      "execution_count": 141,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Neutralny,\n",
+            "Neutralny,\n",
+            "Neutralny,\n",
+            "Neutralny,\n",
+            "Neutralny,\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "P6NJOgzwk-gz"
+      },
+      "source": [
+        "It looks like our model is able to pick up some signal from the prompt. Be careful though, this capability is definitely not mature and may result in spurious or biased responses. "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "n5r8vnFVdHn-"
+      },
+      "source": [
+        "## Zero-Shot Learning\n",
+        "\n",
+        "Large language models are known to store a lot of knowledge in its parameters. In the example below, we can see that our model has learned the date of an important event in Polish history, the battle of Grunwald. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "2lzoMNPic96F",
+        "outputId": "88d5a77a-ec23-4c29-884e-0e51dd059b8f"
+      },
+      "source": [
+        "prompt = \"Bitwa pod Grunwaldem miała miejsce w roku\"\n",
+        "input_ids = tokenizer.encode(prompt, return_tensors='pt')\n",
+        "# activate beam search and early_stopping\n",
+        "beam_outputs = model.generate(\n",
+        "    input_ids, \n",
+        "    max_length=20, \n",
+        "    num_beams=5, \n",
+        "    early_stopping=True,\n",
+        "    num_return_sequences=3\n",
+        ")\n",
+        "\n",
+        "print(\"Output:\\n\" + 100 * '-')\n",
+        "for i, sample_output in enumerate(beam_outputs):\n",
+        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
+      ],
+      "execution_count": 118,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Output:\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "0: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie pod\n",
+            "1: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie pokona\n",
+            "2: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie,\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "k_o4H2v1dWxV"
+      },
+      "source": [
+        ""
+      ],
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}

pretrain_model.sh ADDED Viewed

	@@ -0,0 +1,21 @@

+./run_clm_flax.py \
+    --output_dir="." \
+    --model_type="gpt2" \
+    --config_name="." \
+    --tokenizer_name="." \
+    --dataset_name="oscar" \
+    --dataset_config_name="unshuffled_deduplicated_pl" \
+    --do_train --do_eval \
+    --block_size="512" \
+    --per_device_train_batch_size="64" \
+    --per_device_eval_batch_size="64" \
+    --learning_rate="2e-4" --warmup_steps="5000" \
+    --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
+    --overwrite_output_dir \
+    --num_train_epochs="3" \
+    --logging_steps="3500" \
+    --preprocessing_num_workers="64" \
+    --save_steps="7000" \
+    --eval_steps="7000" \
+    --model_name_or_path="." \
+    --push_to_hub \