birgermoell commited on
Commit
7db48e1
1 Parent(s): 41cd5ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -6
README.md CHANGED
@@ -6,10 +6,13 @@ widget:
6
 
7
  # GPT2-svenska-wikipedia
8
  A swedish GPT2 style model trained using Flax CLM pipeline on the Swedish
9
- part of the wiki40b dataset.
10
-
11
  https://huggingface.co/datasets/wiki40b
12
 
 
 
 
 
13
 
14
  ## Data cleaning and preprocessing
15
  The data was cleaned and preprocessed using the following script. Make sure to install depencies for beam_runner to make the dataset work.
@@ -26,10 +29,18 @@ def load_and_clean_wiki():
26
  return filtered_dataset
27
 
28
  def filter_wikipedia(batch):
29
- batch["text"] = " ".join(batch["text"].split("\n_START_SECTION_\n"))
30
- batch["text"] = " ".join(batch["text"].split("\n_START_ARTICLE_\n"))
31
- batch["text"] = " ".join(batch["text"].split("\n_START_ARTICLE_\n"))
32
- batch["text"] = " ".join(batch["text"].split("\n_START_PARAGRAPH_\n"))
 
 
 
 
 
 
 
 
33
  batch["text"] = " ".join(batch["text"].split("_NEWLINE_"))
34
  batch["text"] = " ".join(batch["text"].split("\xa0"))
35
  return batch
6
 
7
  # GPT2-svenska-wikipedia
8
  A swedish GPT2 style model trained using Flax CLM pipeline on the Swedish
9
+ part of the wiki40b dataset and the Oscar dataset.
 
10
  https://huggingface.co/datasets/wiki40b
11
 
12
+ The model was trained for around 22600 steps (42 hours) as part of the Huggingface Jax/Flax challenge with the following loss and learning rate
13
+ Loss: 3.1715331077575684, Learning Rate: 0.0024816440418362617)
14
+
15
+ The model could likely be trained for longer.
16
 
17
  ## Data cleaning and preprocessing
18
  The data was cleaned and preprocessed using the following script. Make sure to install depencies for beam_runner to make the dataset work.
29
  return filtered_dataset
30
 
31
  def filter_wikipedia(batch):
32
+ batch["text"] = " ".join(batch["text"].split("\
33
+ _START_SECTION_\
34
+ "))
35
+ batch["text"] = " ".join(batch["text"].split("\
36
+ _START_ARTICLE_\
37
+ "))
38
+ batch["text"] = " ".join(batch["text"].split("\
39
+ _START_ARTICLE_\
40
+ "))
41
+ batch["text"] = " ".join(batch["text"].split("\
42
+ _START_PARAGRAPH_\
43
+ "))
44
  batch["text"] = " ".join(batch["text"].split("_NEWLINE_"))
45
  batch["text"] = " ".join(batch["text"].split("\xa0"))
46
  return batch