File size: 1,815 Bytes
a679cf2
906b628
c5e73ca
 
 
906b628
 
 
 
c5e73ca
906b628
059564d
906b628
 
c5e73ca
906b628
a679cf2
 
906b628
 
c5e73ca
906b628
 
 
 
c5e73ca
906b628
a679cf2
906b628
f0128b6
c5e73ca
01bc423
dc2880c
f0128b6
dc2880c
c5e73ca
 
 
 
 
 
 
 
 
 
 
 
 
2afd5bf
c5e73ca
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# How we used ShareGPT to create our benchmark dataset

## sg_90k_part1_html_cleaned.json

### Download ShareGPT dataset
```
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json
```

### Install Fastchat
```
pip install fschat
```

### Clean data:
```
pip install polyglot pyicu pycld2
python -m fastchat.data.optional_clean --in sg_90k_part1_html_cleaned.json --out sg_90k_part1_html_cleaned_lang.json --keep-lang en
```

### Extract first prompt
```
python extract_first.py --in-file sg_90k_part1_html_cleaned_lang.json --out-file sg_90k_part1_html_cleaned_lang_first.json
```

### Sample data
```
python -m fastchat.data.sample --in sg_90k_part1_html_cleaned_lang_first.json --out sg_90k_part1_html_cleaned_lang_first_sampled.json --end 10000 --max-length 10000
```

### Sorted data
We sort the requests by sequence length, placing the longest sequences first. This approach minimizes the amount of padding required and allows for early detection of out-of-memory.
```
python sort.py --data-dir sg_90k_part1_html_cleaned_lang_first_sampled.json --out-file sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json
```

## ShareGPT_V3_filtered.json

### Download ShareGPT dataset
```
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

### Install Transformers
```
pip install transformers
```

### Filter conversations with too long prompts/responses, conversations not started by "human", extract first turn, and randomly sample 500 prompts
```
python filter_dataset.py
```

### Compare the response length distribution of sampled dataset with respect to initial dataset
```
pip install matplotlib numpy
python compare_distributions.py
```