Nanobit commited on
Commit
88bba24
1 Parent(s): ba9ac72

Clean up data readme

Browse files
Files changed (1) hide show
  1. data/README.md +4 -4
data/README.md CHANGED
@@ -1,6 +1,5 @@
1
 
2
- - Download some datasets
3
- -
4
  ```shell
5
  curl https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_gpt4.json -o data/raw/alpaca_data_gpt4.json
6
  curl https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -L -o data/raw/vicuna_cleaned.json
@@ -8,7 +7,7 @@ curl https://github.com/teknium1/GPTeacher/blob/main/Instruct/gpt4-instruct-simi
8
  curl https://github.com/teknium1/GPTeacher/blob/main/Roleplay/roleplay-similarity_0.6-instruct-dataset.json?raw=true -L -o data/raw/roleplay-similarity_0.6-instruct-dataset.json
9
  ```
10
 
11
- - Convert the JSON data files to JSONL.
12
 
13
  ```shell
14
  python3 ./scripts/alpaca_json_to_jsonl.py --input data/alpaca_data_gpt4.json > data/alpaca_data_gpt4.jsonl
@@ -16,8 +15,9 @@ python3 ./scripts/alpaca_json_to_jsonl.py --input data/raw/vicuna_cleaned.json >
16
  python3 ./scripts/alpaca_json_to_jsonl.py --input data/raw/roleplay-similarity_0.6-instruct-dataset.json > data/roleplay-similarity_0.6-instruct-dataset.jsonl
17
  python3 ./scripts/alpaca_json_to_jsonl.py --input data/raw/gpt4-instruct-similarity-0.6-dataset.json > data/gpt4-instruct-similarity-0.6-dataset.jsonl
18
  ```
 
19
 
20
- - Using JSONL makes it easier to subset the data if you want a smaller training set, i.e get 2000 random examples.
21
 
22
  ```shell
23
  shuf -n2000 data/vicuna_cleaned.jsonl > data/vicuna_cleaned.subset0.jsonl
 
1
 
2
+ ## Download some datasets
 
3
  ```shell
4
  curl https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_gpt4.json -o data/raw/alpaca_data_gpt4.json
5
  curl https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -L -o data/raw/vicuna_cleaned.json
 
7
  curl https://github.com/teknium1/GPTeacher/blob/main/Roleplay/roleplay-similarity_0.6-instruct-dataset.json?raw=true -L -o data/raw/roleplay-similarity_0.6-instruct-dataset.json
8
  ```
9
 
10
+ ## Convert the JSON data files to JSONL.
11
 
12
  ```shell
13
  python3 ./scripts/alpaca_json_to_jsonl.py --input data/alpaca_data_gpt4.json > data/alpaca_data_gpt4.jsonl
 
15
  python3 ./scripts/alpaca_json_to_jsonl.py --input data/raw/roleplay-similarity_0.6-instruct-dataset.json > data/roleplay-similarity_0.6-instruct-dataset.jsonl
16
  python3 ./scripts/alpaca_json_to_jsonl.py --input data/raw/gpt4-instruct-similarity-0.6-dataset.json > data/gpt4-instruct-similarity-0.6-dataset.jsonl
17
  ```
18
+ ---
19
 
20
+ Using JSONL makes it easier to subset the data if you want a smaller training set, i.e get 2000 random examples.
21
 
22
  ```shell
23
  shuf -n2000 data/vicuna_cleaned.jsonl > data/vicuna_cleaned.subset0.jsonl