BobaZooba commited on
Commit
3a745bd
1 Parent(s): 625edd1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +316 -0
README.md ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ ---
7
+
8
+ # WeatherGPT: Generate valid JSON from weather descriptions
9
+
10
+ Repository: https://github.com/BobaZooba/wgpt
11
+
12
+ This repository features an example of how to utilize the `xllm` library. Included is a solution for a common type of assessment given to LLM engineers, who typically earn between $120,000 to $140,000 annually. The work, which took 6-7 hours to complete, is representative of actual tasks in the field.
13
+
14
+ [<img src="https://cdn-uploads.huggingface.co/production/uploads/6074d5f1134c000d1ae10d42/JudU3rrPP5i87CfwINANO.png" alt="Powered by X—LLM" width="175" height="32"/>](https://github.com/BobaZooba/xllm)
15
+
16
+ # Task
17
+
18
+ Convert weather description to valid JSON using LLM
19
+
20
+ ### Example
21
+
22
+ **Description:** Today will be mostly sunny with temperatures reaching 25 degrees. There will be a strong wind blowing at 30 km/h. Humidity levels are unknown and there is no precipitation expected.
23
+
24
+ **JSON**
25
+
26
+ ```json
27
+ {
28
+ "weather": "sunny",
29
+ "temperature": 25,
30
+ "wind_speed": 30.0,
31
+ "humidity": null,
32
+ "precipitation": null,
33
+ "visibility": "good",
34
+ "air_quality": null,
35
+ "real_feel_temperature": null
36
+ }
37
+ ```
38
+
39
+ # Installation
40
+
41
+ Run in terminal:
42
+
43
+ ```bash
44
+ pip install -e .
45
+ ```
46
+
47
+ ## Environment
48
+
49
+ Python3.8+, CUDA 11.8
50
+
51
+ **Suggested docker:** huggingface/transformers-pytorch-gpu:latest
52
+
53
+ All entry points at **Makefile**
54
+
55
+ # Implementation
56
+
57
+ - Generate data
58
+ - ChatGPT with few-shot variable examples
59
+ - Train model
60
+ - QLoRA
61
+ - DeepSpeed Stage 2
62
+ - 4 bit quantization
63
+ - Gradient checkpointing
64
+ - Mistal AI 7B
65
+ - Evaluate
66
+ - Output can be parsed
67
+ - ChatGPT labeling
68
+
69
+ # Reproduce
70
+
71
+ 1. Install requirements
72
+
73
+ ```sh
74
+ make install
75
+ ```
76
+
77
+ 2. Make **.env** and fill it with your values (please take a look at _.env.template_)
78
+
79
+ 3. _[Optional]_ Generate data
80
+
81
+ ```sh
82
+ make generate-data
83
+ ```
84
+
85
+ Also example data is provided in _data_ folder
86
+
87
+ 4. Prepare data and model
88
+
89
+ ```sh
90
+ make prepare
91
+ ```
92
+
93
+ 5. Train the model
94
+
95
+ ```sh
96
+ make train
97
+ ```
98
+
99
+ Or train with DeepSpeed (if you have multiple GPUs, please specify `CUDA_VISIBLE_DEVICES` to use only one)
100
+
101
+ ```sh
102
+ make train-deepspeed
103
+ ```
104
+
105
+ 6. Fuse LoRA weights
106
+
107
+ ```sh
108
+ make fuse
109
+ ```
110
+
111
+ 7. Evaluate
112
+
113
+ ```sh
114
+ make evaluate
115
+ ```
116
+
117
+ # Data generation
118
+
119
+ ### Why generated?
120
+
121
+ - ChatGPT was chosen for data collection because I couldn't find similar datasets and because this method scales to other domains and many companies will need to do this in one way or another
122
+ - Previously, NLP engineers suffered from the lack of datasets, but now they can be generated and this will serve as a great starting point for problem-solving
123
+ - The cost of compiling the initial dataset has decreased from thousands of dollars to a few bucks
124
+
125
+ ## Prompt
126
+
127
+ Please also check `wgpt/core/prompts.py`
128
+
129
+ ### Example
130
+
131
+ ```txt
132
+ Your task is to create diverse examples where a free-form description of weather is translated into a JSON file format.
133
+ Each description should be between 2 to 5 sentences long with as much diversity as possible. Feel free to omit some fields, add new information, or write in a variety of styles.
134
+ The JSON format requires the following fields: weather (str), temperature (int), wind_speed (float), humidity (float), precipitation (str), visibility (str), air_quality (str), and real_feel_temperature (int). If any value is unknown, use null.
135
+ The "temperature" and "real_feel_temperature" should be in degrees, wind_speed should be in kilometers per hour, and "humidity" is in percentage. The fields "weather", "precipitation", "visibility" should be single word descriptions.
136
+ The format of your answer should be: 1. Input: ...
137
+ Output: ...
138
+ 2. Input: ...
139
+ Output: ...
140
+ Examples:
141
+ 1. Input: The skies are clear with a temperature of about 25 degrees. The wind is blowing gently at around 7kph. Visibility is high and the air is quite dry with a humidity around 30%. There's no precipitation. Feels like it's exactly 25 degrees. The air quality is very good today.
142
+ Output: {"weather": "clear", "temperature": 25, "wind_speed": 7.0, "humidity": 30.0, "precipitation": "none", "visibility": "high", "air_quality": "good", "real_feel_temperature": 25}
143
+ 2. Input: It's snowing outside and the temperature is -5 degrees. There's a strong wind blowing at 25kph. Visibility is very low because of the snow. Humidity is around 80%. Air quality is moderate today. The real feel is much lower at -10 degrees due to wind chill.
144
+ Output: {"weather": "snow", "temperature": -5, "wind_speed": 25.0, "humidity": 80.0, "precipitation": "snow", "visibility": "low", "air_quality": "moderate", "real_feel_temperature": -10}
145
+ 3. Input: Expect a cloudy evening with a temperature of about 18 degrees. There is a slight chance of light showers, and the wind is gentle at 5 km/h.
146
+ Output: {"weather": "cloudy", "temperature": 18, "wind_speed": 5.0, "humidity": null, "precipitation": "light", "visibility": "good", "air_quality": null, "real_feel_temperature": null}
147
+ You need to create a dataset where plain text weather descriptions are converted into valid JSON files. Provide {num_samples} diverse samples similar to the example given.
148
+ ```
149
+
150
+ <details>
151
+ <summary>Detailed explanation</summary>
152
+
153
+ #### Task
154
+
155
+ ```txt
156
+ Your task is to create diverse examples where a free-form description of weather is translated into a JSON file format.
157
+ ```
158
+
159
+ #### Description requirements
160
+
161
+ ```txt
162
+ Each description should be between 2 to 5 sentences long with as much diversity as possible. Feel free to omit some fields, add new information, or write in a variety of styles.
163
+ ```
164
+
165
+ #### JSON and fields requirements
166
+
167
+ ```txt
168
+ The JSON format requires the following fields: weather (str), temperature (int), wind_speed (float), humidity (float), precipitation (str), visibility (str), air_quality (str), and real_feel_temperature (int). If any value is unknown, use null.
169
+
170
+ The "temperature" and "real_feel_temperature" should be in degrees, wind_speed should be in kilometers per hour, and "humidity" is in percentage. The fields "weather", "precipitation", "visibility" should be single word descriptions.
171
+ ```
172
+
173
+ #### Format of response
174
+
175
+ ```txt
176
+ The format of your answer should be:
177
+ 1. Input: ...
178
+ Output: ...
179
+ 2. Input: ...
180
+ Output: ...
181
+ ```
182
+
183
+ #### Few-shot examples
184
+
185
+ Randomly selected from 3 to 5 of the pre-prepared. It is necessary to provide variety and explain the task.
186
+
187
+ ```txt
188
+ Examples:
189
+ 1. Input: The skies are clear with a temperature of about 25 degrees. The wind is blowing gently at around 7kph. Visibility is high and the air is quite dry with a humidity around 30%. There's no precipitation. Feels like it's exactly 25 degrees. The air quality is very good today.
190
+ Output: {"weather": "clear", "temperature": 25, "wind_speed": 7.0, "humidity": 30.0, "precipitation": "none", "visibility": "high", "air_quality": "good", "real_feel_temperature": 25}
191
+ 2. Input: It's snowing outside and the temperature is -5 degrees. There's a strong wind blowing at 25kph. Visibility is very low because of the snow. Humidity is around 80%. Air quality is moderate today. The real feel is much lower at -10 degrees due to wind chill.
192
+ Output: {"weather": "snow", "temperature": -5, "wind_speed": 25.0, "humidity": 80.0, "precipitation": "snow", "visibility": "low", "air_quality": "moderate", "real_feel_temperature": -10}
193
+ 3. Input: Expect a cloudy evening with a temperature of about 18 degrees. There is a slight chance of light showers, and the wind is gentle at 5 km/h.
194
+ Output: {"weather": "cloudy", "temperature": 18, "wind_speed": 5.0, "humidity": null, "precipitation": "light", "visibility": "good", "air_quality": null, "real_feel_temperature": null}
195
+ ```
196
+
197
+ #### Direct call to action
198
+
199
+ Also, indicating the number of desired examples
200
+
201
+ ```txt
202
+ You need to create a dataset where plain text weather descriptions are converted into valid JSON files. Provide {num_samples} diverse samples similar to the example given.
203
+ ```
204
+
205
+ </details>
206
+
207
+ ### Example of output
208
+
209
+ ```txt
210
+ 1. Input: The sun is shining brightly with a temperature reaching a scorching 38 degrees. There is a moderate breeze blowing at a speed of 15kph. Visibility is clear with no obstructions. Humidity is quite low at around 20%. No precipitation is expected. The real feel temperature is similar to the actual temperature.
211
+ Output: {"weather": "sunny", "temperature": 38, "wind_speed": 15.0, "humidity": 20.0, "precipitation": "none", "visibility": "clear", "air_quality": null, "real_feel_temperature": 38}
212
+ 2. Input: It's a hot and humid day with a temperature of 32 degrees. There is no wind present and the air is quite heavy. Visibility is reduced due to haze. Humidity is extremely high at around 90%. No precipitation is predicted. The real feel temperature is slightly higher at 34 degrees.
213
+ Output: {"weather": null, "temperature": 32, "wind_speed": null, "humidity": 90.0, "precipitation": null, "visibility": "reduced", "air_quality": null, "real_feel_temperature": 34}
214
+ 3. Input: The weather today is cloudy with a temperature of 22 degrees. A light breeze is blowing at 10kph. Visibility is good and there is no precipitation expected. Humidity is moderate at around 50%. The real feel temperature is the same as the actual temperature.
215
+ Output: {"weather": "cloudy", "temperature": 22, "wind_speed": 10.0, "humidity": 50.0, "precipitation": "none", "visibility": "good", "air_quality": null, "real_feel_temperature": 22}
216
+ 4. Input: It's a gloomy day with overcast skies. The temperature is a chilly 8 degrees. Strong winds are howling at 40kph. Visibility is reduced due to fog. Humidity is high at 85%. Light rain is expected. The real feel temperature is lower at 5 degrees due to wind chill.
217
+ Output: {"weather": "overcast", "temperature": 8, "wind_speed": 40.0, "humidity": 85.0, "precipitation": "rain", "visibility": "reduced", "air_quality": null, "real_feel_temperature": 5}
218
+ 5. Input: Enjoy a beautiful spring day with clear blue skies and a temperature of 20 degrees. A gentle breeze is rustling the leaves at 12kph. Visibility is excellent with no obstructions. Humidity is moderate at 55%. No precipitation is expected. The real feel temperature matches the actual temperature.
219
+ Output: {"weather": "clear", "temperature": 20, "wind_speed": 12.0, "humidity": 55.0, "precipitation": "none", "visibility": "excellent", "air_quality": null, "real_feel_temperature": 20}
220
+ ```
221
+
222
+ ## Results of data generation
223
+
224
+ - There were 5848 examples generated for training (including the validation set), which took about 10 minutes
225
+ - The ChatGPT model was used because it is 30 times cheaper and much faster. In real projects, I would use ChatGPT, GPT4, as well as open models, to obtain as diverse a dataset as possible
226
+ - The examples turned out to be quite lively and met the requirements of the task
227
+
228
+ # Training
229
+
230
+ - Model: [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
231
+ - Boilerplate (QLoRA, DeepSpeed Stage 2, 4 bit quantization, Gradient checkpointing): [xllm](https://github.com/BobaZooba/xllm)
232
+
233
+ `xllm` is a user-friendly library that streamlines training optimization, so you can focus on enhancing your models and data. Equipped with cutting-edge training techniques, `xllm` is engineered for efficiency by engineers who understand your needs.
234
+
235
+ [<img src="https://github.com/BobaZooba/xllm/blob/main/static/images/xllm-badge.png" alt="Powered by X—LLM" width="175" height="32"/>](https://github.com/BobaZooba/xllm)
236
+
237
+ ### Methods
238
+
239
+ - **QLoRA (and 4 bit bnb quantization)**. The preferred method of fine-tuning, as it usually ensures a higher quality than full fine-tuning due to effective management of catastrophic forgetting. And of course, it optimizes the memory utilized during training
240
+ - I use LoRA for all linear layers except lm_head and embeddings
241
+ - The original paper does not investigate which layers are better to apply. Please check [AdaLoRA paper](https://arxiv.org/pdf/2303.10512.pdf)
242
+ - I also use a fairly high rank for low-rank optimization: 64 (alpha is 32)
243
+ - **DeepSpeed Stage 2 (with CPU offloading)**. I'm using Deepspeed for training on multiple GPUs. Stage 2 was used because there are observed issues with the use of Stage 3 and quantized models
244
+ - **Gradient Checkpointing**. Very strong optimization of used memory at the expense of slowing down training
245
+
246
+ ### `xllm` details
247
+
248
+ In the xllm library, there are several important steps: prepare, train, fuse, quantize
249
+
250
+ - **Prepare**. The data preprocessing and model download step has been separated to avoid doing it on each of the processes in distributed learning mode
251
+ - **Train**. Training the model and saving checkpoints. I save checkpoints in HuggingFace Hub. Since I am training through LoRA, those weights are specifically saved.
252
+ - **Fuse**. Fusing LoRA weights with the backbone model and push it to HF Hub
253
+ - _[Optional]_ **Quantize**. GPTQ quantization of the model
254
+
255
+ ### Task details
256
+
257
+ I only calculated the loss for the json part, didn't calculate it for the description
258
+
259
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6074d5f1134c000d1ae10d42/kHeuMnIuTzfqfswD3Lg-8.jpeg)
260
+
261
+ ### Results of the training
262
+
263
+ - [LoRA weights](https://huggingface.co/BobaZooba/WGPT-LoRA)
264
+ - [Fused model](https://huggingface.co/BobaZooba/WGPT)
265
+ - [W&B link](https://api.wandb.ai/links/bobazooba/8v7pqflf)
266
+
267
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6074d5f1134c000d1ae10d42/ILbN27kYCo9QKV2FwG8wL.jpeg)
268
+
269
+ # Evaluation
270
+
271
+ ## Metrics
272
+
273
+ ### Why no BLEU / ROUGE / etc?
274
+
275
+ I have been evaluating generative models for several years and believe that currently using n-gram methods to evaluate generative models is an extremely poor practice. BertScore is also not a sufficiently good method. Currently, there are only two good ways to evaluate generative models: human evaluation and emulation of human evaluation (GPT-like instructional models and training of rankers/classifiers on human evaluations).
276
+
277
+ ### Output can be parsed
278
+
279
+ Simple proxy metric: we try to parse the model's response. We calculate the percentage of responses that we were able to parse
280
+
281
+ ### ChatGPT labeling
282
+
283
+ Emulation of human assessment. ChatGPT is given an instruction and the output of my model. ChatGPT must provide one of several responses: the correct answer, there are minor inaccuracies, incorrect answer. In real projects, I would use only GPT-4, but it is too expensive.
284
+
285
+ #### ChatGPT labeling instruction
286
+
287
+ ```txt
288
+ Your task is to validate whether the model has correctly parsed the weather description into JSON. The model was given a free-form weather description in natural language. Its task was to transform this description into valid JSON. Your job: understand whether the model has correctly parsed what was stated in the text, whether it correctly filled in the fields, with the correct values.
289
+ The JSON format requires the following fields: weather (str), temperature (int), wind_speed (float), humidity (float), precipitation (str), visibility (str), air_quality (str), and real_feel_temperature (int). If any value is unknown, use null.
290
+ The "temperature" and "real_feel_temperature" should be in degrees, wind_speed should be in kilometers per hour, and "humidity" is in percentage. The fields "weather", "precipitation", "visibility" should be single word descriptions.
291
+ Weather description: {weather_description}
292
+ Model response: {model_response}
293
+ Ground truth: {ground_truth}
294
+ You need to consider whether the model has parsed the answer correctly and give your assessment. The rating options can only be: correct, minor inaccuracies, incorrect.
295
+ Format of your answer.
296
+ Reasoning: ...
297
+ Assessment: ...
298
+ ```
299
+
300
+ ## Results
301
+
302
+ Output can be parsed: **100%**
303
+
304
+ ### ChatGPT labeling
305
+
306
+ |Correct|Minor inaccuracies|Incorrect|
307
+ |-|-|-|
308
+ |48%|51%|1%|
309
+
310
+ # Future works
311
+
312
+ - Improve evaluation
313
+ - Need to add a method that compares the JSON response with the generated JSON. We know the types of fields. For numeric fields, you should use MSE, and for text fields, you should use the proximity of text embeddings, having previously selected the model
314
+ - If the quality of the current model is satisfactory, it should be deployed into production (in a quantized version) using TGI at least for a limited number of users.
315
+ - It is crucial to gather real data from production to fine-tune the model. Then these data need to be labeled, train a discriminator model (which would assess the quality of responses), filter the data and further train the model. For this task, I wouldn't apply RL, only the ReST ([link](It is crucial to gather real data from production to fine-tune the model. Then these data need to be labeled, train a discriminator model (which would assess the quality of responses), filter the data and further train the model. For this task, I wouldn't apply RL, only the ReST (link) method, which I would improve. Such actions on labeling and further training should be performed regularly. Ideally, an infrastructure for constant retraining should be developed. The recommended frequency depends on the traffic volumes. Usually, for manual re-learning the frequency is monthly, for automatic – weekly. Also, because we apply labeling, we can track model improvements. This will be particularly useful when the labeling instruction is stabilized. With the help of the discriminator we can adjust the hyperparameters for training and inference, for example, generation parameters. Also, with the discriminator, we can immediately assess several hypotheses from the generative model and deliver only the best one to the user. Currently, this method is not widely used due to the significantly increasing load on the generative model, so I would focus on the further training of the generative model using the discriminator.)) method, which I would improve. Such actions on labeling and further training should be performed regularly. Ideally, an infrastructure for constant retraining should be developed. The recommended frequency depends on the traffic volumes. Usually, for manual re-learning the frequency is monthly, for automatic – weekly. Also, because we apply labeling, we can track model improvements. This will be particularly useful when the labeling instruction is stabilized. With the help of the discriminator we can adjust the hyperparameters for training and inference, for example, generation parameters. Also, with the discriminator, we can immediately assess several hypotheses from the generative model and deliver only the best one to the user. Currently, this method is not widely used due to the significantly increasing load on the generative model, so I would focus on the further training of the generative model using the discriminator.
316
+ - If the quality of the current model is not satisfactory, similar steps will need to be taken with synthetic data, deploy it into production, and then perform the same steps with the data from production.