Lacking documentation of datasets used, architecture, fine-tuning procedures, source code

#15
by markding - opened

Intriguing to see Solar as "a great example of the progress enabled by open source". However, for a model claiming this, Solar is remarkably silent about its own sources. The dataset details are given as:

Orca-style dataset
Alpaca-style dataset

It would be helpful to document exactly which datasets and which versions have been used, and to specify the instruction tuning process and overall architecture in more detail.

Currently, this model trails the very bottom of the openness leaderboard: it is more closed and less documented than even Llama2 itself. Hoping to see this improve!

Well, they were enabled by open source, but they are clearly NOT open source. No dataset, no code, and no commercial use. "Weights available for evaluation/personal use" is not open. One might assume that since they said "alpaca-style dataset" that they are using an actual Alpaca variant - and as Alpaca is CC-by-NC, they may feel they must then restrict to CC-by-NC; but cc-by-nc also requires attribution and "alpaca-style" is NOT an attribution and it would mean, imo, that they were breaching the Alpaca terms. Strange days.

Sign up or log in to comment