File size: 3,385 Bytes
8e1d73e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
base_model: EleutherAI/pythia-160m-deduped
tags:
- generated_from_trainer
datasets:
- FineWebSentences
metrics:
- accuracy
model-index:
- name: pythia-finewebedu
  results:
  - task:
      name: Causal Language Modeling
      type: text-generation
    dataset:
      name: FineWebSentences
      type: FineWebSentences
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.24020533058796614
---

# pythia-finewebedu

- Generate half intelligible English sentences using a small GPT like model.
- Will output one sentence at a time.

This model is a fine-tuned version of [EleutherAI/pythia-160m-deduped](https://huggingface.co/EleutherAI/pythia-160m-deduped) on the FineWebSentences dataset.
It achieves the following results on the evaluation set:
- Loss: 4.7702
- Accuracy: 0.2402

## Model description

To generate 10 random sentences starting from an empty string on a CUDA device:

```python
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='agentlans/pythia-finewebedu', device='cuda')

set_seed(1234)
results = generator("", max_length=100, num_return_sequences=10, do_sample=True)

for x in results:
	print(x['generated_text'])
```

Output:
```text
They are also, you need to get great results at her school.
According to him the term of the Newer, as an entity of the country.
- To provide less information to help prevent and respond appropriately, it also seems to take action.
He was an important historical project that he was going to have a history, but the fact that he lived in the US and then he can move back to where he left.
By the use of the ESLP and INGELTS OF THE TRAIL ORD and REPORTANCE OR:
However, the system and the Internet have not been built.
To bridge your teeth with your teeth of the plaque build up with the new teeth and tartar attachments to the tissues, as those without an orthoker.
This is more difficult than other to learn the basics of the workbooks, where a few thousand notes the same idea that the author can be seen on the work of the project.)
This study was that by one of the six states, in the middle of a union that he had to marry or union union.
- A-Pangana and Pitta, P.A. L. T.C.
```

## Intended uses & limitations

- For generating short lines of English text
- Could be useful for
  - data augmentation
  - creative inspiration
  - entertainment
  - CAPTCHA
- Can be further finetuned on other data such as:
  - prompts
  - famous quotes
  - news headlines
  - blog post titles

Limitations include:

- Not guaranteed to make sensible, coherent, or grammatically correct sentences
- No regard for accuracy or truthfulness whatsoever
  - It's a bunch of words from a probability model, what do you expect?

## Training and evaluation data

Sentences from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0

### Training results

No overfitting. Lower loss with Pythia-160m than with Pythia-70m, as expected.

### Framework versions

- Transformers 4.39.3
- Pytorch 2.3.0+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2