fraserlove commited on
Commit
3d38a23
·
verified ·
1 Parent(s): eec1971

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -14
README.md CHANGED
@@ -4,6 +4,7 @@ datasets:
4
  - HuggingFaceFW/fineweb-edu
5
  language:
6
  - en
 
7
  ---
8
  # GPT 124M
9
  A pretrained GPT model with 124M parameters trained on 40B tokens of educational content. The full implementation of the model can be found on GitHub [here](https://github.com/fraserlove/gpt). The model was trained for 4 epochs on the 10B token subset of [fineweb-edu](https://arxiv.org/pdf/2406.17557), a large-scale dataset of educational content. The model surpassed GPT-3 124M on [HellaSwag](https://arxiv.org/pdf/1905.07830) after just 38B tokens, this is a 7.8x improvement over GPT-3 which was trained on 300B tokens. The final model at 40B tokens achieved a HellaSwag score of 0.339.
@@ -20,23 +21,41 @@ India’s story begins with a very ancient Vedic religion. They were the ancient
20
  Once upon a time, the King of Italy, who was to govern what would become the world, thought that it would be a great and noble undertaking to introduce the Roman Senate into the country in order to defend Rome — to defend her own capital in a very civilized manner, to promote the arts and promote the Roman religion. Accordingly, Rome,
21
  ```
22
 
23
- ### Inference
24
- The GPT model can be used for inference using the `inference.py` script. The script generates completions given a context. The completions are generated using the top-k sampling strategy. The maximum length of the completions, temperature and k value can be set in the script. The model can be loaded from a PyTorch checkpoint `torch.load('cache/logs/124M.pt', map_location=device)` or from a cached Hugging Face model `GPT.from_pretrained('cache/models')` after training. The model can then be used for inference as follows:
25
-
26
  ```python
27
- import torch
28
- from gpt import GPT
29
- from transformers import AutoTokenizer
 
 
 
 
 
 
 
 
30
 
31
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
32
 
33
- # Load the tokeniser and model
 
34
  tokeniser = AutoTokenizer.from_pretrained('fraserlove/gpt-124m')
35
- model = GPT.from_pretrained('fraserlove/gpt-124m').to(device)
36
-
37
  context = 'Once upon a time,'
38
- context = torch.tensor(tokeniser.encode(context), dtype=torch.long).to(device)
39
- samples = model.generate(context, n_samples=2, max_tokens=64)
40
- samples = [samples[j, :].tolist() for j in range(samples.size(0))]
41
- print('\n'.join(tokeniser.decode(sample).split('<|endoftext|>')[0] for sample in samples))
 
 
 
 
 
 
 
 
 
 
 
42
  ```
 
4
  - HuggingFaceFW/fineweb-edu
5
  language:
6
  - en
7
+ pipeline_tag: text-generation
8
  ---
9
  # GPT 124M
10
  A pretrained GPT model with 124M parameters trained on 40B tokens of educational content. The full implementation of the model can be found on GitHub [here](https://github.com/fraserlove/gpt). The model was trained for 4 epochs on the 10B token subset of [fineweb-edu](https://arxiv.org/pdf/2406.17557), a large-scale dataset of educational content. The model surpassed GPT-3 124M on [HellaSwag](https://arxiv.org/pdf/1905.07830) after just 38B tokens, this is a 7.8x improvement over GPT-3 which was trained on 300B tokens. The final model at 40B tokens achieved a HellaSwag score of 0.339.
 
21
  Once upon a time, the King of Italy, who was to govern what would become the world, thought that it would be a great and noble undertaking to introduce the Roman Senate into the country in order to defend Rome — to defend her own capital in a very civilized manner, to promote the arts and promote the Roman religion. Accordingly, Rome,
22
  ```
23
 
24
+ ## Inference
25
+ The model can be directly used with a pipeline for text generation:
 
26
  ```python
27
+ >>> from transformers import pipeline, set_seed
28
+ >>> generator = pipeline('text-generation', model='fraserlove/gpt-124m')
29
+ >>> set_seed(0)
30
+ >>> generator('Once upon a time,', max_length=30, num_return_sequences=5, do_sample=True)
31
+
32
+ [{'generated_text': 'Once upon a time, my father had some way that would help him win his first war. There was a man named John. He was the husband'},
33
+ {'generated_text': 'Once upon a time, this particular breed would be considered a “chicken fan”; today, the breed is classified as a chicken.'},
34
+ {'generated_text': 'Once upon a time, there was a famous English nobleman named King Arthur (in the Middle Ages, it was called ‘the Arthur’'},
35
+ {'generated_text': "Once upon a time, the Christian God created the world in the manner which, under different circumstances, was true of the world's existence. The universe"},
36
+ {'generated_text': 'Once upon a time, I wrote all of the letters of an alphabets in a single document. Then I read each letter of that alphabet'}]
37
+ ```
38
 
39
+ The model can also be used directly for inference:
40
 
41
+ ```python
42
+ from transformers import AutoTokenizer, AutoModelForCausalLM
43
  tokeniser = AutoTokenizer.from_pretrained('fraserlove/gpt-124m')
44
+ model = AutoModelForCausalLM.from_pretrained('fraserlove/gpt-124m')
 
45
  context = 'Once upon a time,'
46
+ context = tokeniser.encode(context, return_tensors='pt')
47
+ samples = model.generate(context, max_new_tokens=64, do_sample=True, num_return_sequences=2)
48
+ decoded = tokeniser.batch_decode(samples)
49
+ print('\n'.join(decoded))
50
+ ```
51
+
52
+ To get the features of a given text:
53
+
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
56
+ tokeniser = AutoTokenizer.from_pretrained('fraserlove/gpt-124m')
57
+ model = AutoModelForCausalLM.from_pretrained('fraserlove/gpt-124m')
58
+ text = 'Once upon a time,'
59
+ encoded_input = tokeniser(text, return_tensors='pt')
60
+ output = model(**encoded_input)
61
  ```