Text Generation
Transformers
PyTorch
RefinedWeb
falcon-40b
rlhf
falcon
custom_code
text-generation-inference
Inference Endpoints
File size: 6,206 Bytes
fb8bfcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73bed7a
fb8bfcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d57f73c
fb8bfcb
e60fdf7
fb8bfcb
 
d57f73c
fb8bfcb
 
 
d57f73c
fb8bfcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: apache-2.0
datasets:
- Anthropic/hh-rlhf
- OpenAssistant/oasst1
- databricks/databricks-dolly-15k
language:
- en
- fr
- de
- es
- it
---
# Model Card for Alfred-40B-0723

`Alfred-40B-0723` is a finetuned version of [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b), obtained with Reinforcement Learning from Human Feedback (RLHF).
It is the first of a series of RLHF models based on Falcon-40B that will be regularly released. It is made available under the Apache 2.0 License.

## Model Details

### Model Description

- **Developed by:** [LightOn](https://www.lighton.ai/)
- **Model type:** Causal decoder-only;
- **Language(s) (NLP):** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);
- **License:** Apache 2.0 license.
- **Finetuned from model:** [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b)

## Uses

### Direct Use

`Alfred-40B-0723` can be used as an instruct or chat model. We encourage its usage for research on large language models finetuned with RLHF as well.

### Out-of-Scope Use

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful. 

## Bias, Risks, and Limitations

`Alfred-40B-0723` is a finetune of Falcon-40B. As such, it is trained mostly on English, German, Spanish, French, with limited capabilities also in in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

### Recommendations

We recommend users of `Alfred-40B-0723` to implement appropriate guardrails and precautions in any production use.

## How to Get Started with the Model

Use the code below to get started with the model.

```
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "lightonai/alfred-40b-0723"
tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)


sequences = pipeline(
   "Write a short text to announce that the new transformer model Alfred is available in open-source on Huggingface, include emojis.",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")
```

## Training Details

### Training Data

Alfred-40B-0723 was trained on a mixture of publicly available and in-house curated datasets. 

| **Data source**    |
|--------------------|
| [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) | 
| [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) |
| [dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)  |
| [NatInstV2](https://github.com/allenai/natural-instructions) |
| momentum-internal |

`momentum-internal` is a collection of prompts rated as gold quality from the staff of LightOn in their daily workflow.

### Training Procedure 

`Alfred-40B-0723` was trained on 128 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=4) combined with ZeRO. The value model is initialized from the reward model and does not have any shared parameters with the policy network.

#### Preprocessing

Samples from each of the datasets have been programmatically formatted to chat, instructions and few-shot promtps.

#### Training Hyperparameters

##### Policy and Value Optimizer Config

| **Hyperparameter** | **Value**  | **Comment**                               |
|--------------------|------------|-------------------------------------------|
| Precision          | `bfloat16` |                                           |
| Optimizer          | AdamW      |                                                            |
| Learning rate      | 1.85e-6    | 10 warm-up steps, cosine decay over a 100 steps to 1.85e-7 |

##### Trainer config
| **Hyperparameter** | **Value**  |
|--------------------|------------|
| Num Rollouts       | 1024       | 
| Policy Epochs      | 1          | 
| Value Epochs       | 1          | 
| KL Coef            | 0.01       | 
| Gamma              | 1.0        | 
| GAE Lambda         | 0.95       | 
| Clip Range Policy  | 0.2        | 
| Clip Range Value   | 0.2        | 
| Whiten Advantages  | `true`     | 
| Whiten Rewards     | `false`    | 
| Score on EOD       | `true`     | 
| Max Steps          | 200        | 
| PPO steps/epoch    | 1          | 
| Value steps/epoch  | 8          | 

##### Trajectory data config
| **Hyperparameter**   | **Value**  |
|----------------------|------------|
| Continuation Max Len | 1024       |
| Continuation Min Len | 0          |
|                Top P | 1.0        |
|          Temperature | 1.0        |


## Evaluation

![aggregated evaluation of RAW vs SFT vs PPO - including random baseline - PPO suffers in arithmetic due to effects on calibration](https://i.ibb.co/9yQFJ40/aggregated.png "aggregated evaluation of RAW vs SFT vs PPO - including random baseline")

First evaluation results aggregated from the EleutherAI harness:
- Arithmetic capabilities become much worse
- Common Sense, Paraphrase, Reasoning, Reading Comprehension stay at about the same level
- NLI becomes better and QA gets worse

Overall these results were expected from the literature. Benchmarks don't really correlate with human preference. 
All these metrics use a Select methodology, and it since RLHF models are far less calibrated than raw LLMs, they will be punished in these evaluations. 

Human evaluation is currently ongoing.

### Compute Infrastructure

#### Hardware

Alfred-40B-0723 was trained on AWS SageMaker, on 128 A100 40GB GPUs in P4d instances.

#### Software

Alfred-40B-0723 was trained with a custom RLHF codebase. Training leverages a 3D parallelism approach combined with ZeRO, as well as high-performance kernels such as FlashAttention.

## Model Card Contact

contact@lighton.ai