Spaces:

ml-energy
/

leaderboard

Running

App Files Files Community

leaderboard / sharegpt /README.md

ohjuny

Filter Conversations Not Started By "human" (#35)

2afd5bf unverified 7 months ago

preview code

raw

history blame contribute delete

No virus

1.82 kB

	# How we used ShareGPT to create our benchmark dataset

	## sg_90k_part1_html_cleaned.json

	### Download ShareGPT dataset
	```
	https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json
	```

	### Install Fastchat
	```
	pip install fschat
	```

	### Clean data:
	```
	pip install polyglot pyicu pycld2
	python -m fastchat.data.optional_clean --in sg_90k_part1_html_cleaned.json --out sg_90k_part1_html_cleaned_lang.json --keep-lang en
	```

	### Extract first prompt
	```
	python extract_first.py --in-file sg_90k_part1_html_cleaned_lang.json --out-file sg_90k_part1_html_cleaned_lang_first.json
	```

	### Sample data
	```
	python -m fastchat.data.sample --in sg_90k_part1_html_cleaned_lang_first.json --out sg_90k_part1_html_cleaned_lang_first_sampled.json --end 10000 --max-length 10000
	```

	### Sorted data
	We sort the requests by sequence length, placing the longest sequences first. This approach minimizes the amount of padding required and allows for early detection of out-of-memory.
	```
	python sort.py --data-dir sg_90k_part1_html_cleaned_lang_first_sampled.json --out-file sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json
	```

	## ShareGPT_V3_filtered.json

	### Download ShareGPT dataset
	```
	https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
	```

	### Install Transformers
	```
	pip install transformers
	```

	### Filter conversations with too long prompts/responses, conversations not started by "human", extract first turn, and randomly sample 500 prompts
	```
	python filter_dataset.py
	```

	### Compare the response length distribution of sampled dataset with respect to initial dataset
	```
	pip install matplotlib numpy
	python compare_distributions.py
	```