It talks "wikipidia" with [1] "links"

#39
by kil3r - opened

Quite often the answer looks like:
What is the meaning of pi?

Pi (Greek: πίτσα, pítsa) is the ratio between the circumference of a circle and its diameter;
it is commonly represented by the Greek letter π (π). The number π is an irrational number
whose decimal expansion continues without pattern beyond the first few digits.[1] It is approximately equal to
3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117068.

or
What can you tell me about Australia?

Australia, officially known as Commonwealth of Australia,[b] is a sovereign country comprising 
the mainland of the Australian continent, the island of Tasmania[c] and numerous smaller 
islands.[d][e] It is situated in the South Pacific Ocean between the North and South poles (between
latitudes 10°SI and 35°SI south), east of New Guinea on the north-east by Indonesia's Papua Province
on the west; north of New Zealand across the Tasman Sea with a maritime boundary extending out to
200 nautical miles from its coastline.[f] The population of 24 million people makes it the world’s sixth
largest nation, while its total area is 7,617,300 km2 (2,961,300 sq mi).[g] Its capital city is Canberra 
which was established in 1908 following the federation of six self-governing British colonies along with 
several associated territories into a single dominion under the name of "The Commonwealth of Australia".

The [g] and [1] alike should be cleared from the training data.

Hi @kil3r , indeed that's coming from the databricks/databricks-dolly-15k dataset used to train MPT, as Databricks mentions that those should be removed from downstream applications, but those references have not been removed from the training set, more information at https://huggingface.co/datasets/databricks/databricks-dolly-15k#dataset-overview So on, a simple regex to remove those should do the work until the dataset is cleaned and the models trained with that data re-trained or fine-tuned

Mosaic ML, Inc. org

Thanks for pointing this out, @kil3r . @alvarobartt 's response is correct. You can strip the fake citations out with a regex, and we will strip them in the next iteration of our instruct data.

sam-mosaic changed discussion status to closed

Great to know @sam-mosaic !

We think you might be interested in this curated version of Dolly, which we have programmatically and manually curated with the community. The manual curation part accounts for 400 examples, which account to more than 10% of some types of tasks (e.g., summarization). The Wikipedia references have been also fixed since early versions of the curated dataset.

As a result, the dataset contains significantly fewer tokens for some fields because labelers tended to add very long context values, likely without knowing this was going to be used as prompt+input (like in Alpaca).

Here's the dataset and the readme, where we detail some of these aspects: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual

We have also kept track of the ids and the field values of the original Databricks dataset (for provenance and reproducibility)

Additionally, it contains automatic translations for es, fr, and de, in case it's also useful for you.

We'd love to hear back from you, respond to any questions about it, or help in any way

Sign up or log in to comment