Posted eval results appear lower, what is the justification for this model's utility?

#3
by spaceman7777 - opened

Noticed that the comparison of the eval scores of this model and the original GPT-J show LightGPT scoring lower on almost all of the posted tests.

What is the reasoning behind this model's release? Is there some evaluation that shows this model has improved at completing some function?

Or is it a statement that instruct tuned models should score lower than their counterparts?

Please explain, I'm curious.

Attaching a screenshot of the scores posted in the model card
Selection_1011.png

Amazon Web Services org
edited Jun 2, 2023

Thank you for your useful comment and questions!

The original GPT-J is a great pre-trained model. But it could be further fine-tuned for practical use cases. Particularly, if you are looking for a version better suited to taking generic instructions in a question and answering format, we recommend take a look at LightGPT.

For example, for the same question on "How can I tell if a pomegranate is ripe?" as shown in the model card, here are two sampled answers generated from the original GPT-J:

----
A: Pomes are fragile. Once you open one, all bets are off on whether it will ripen further. Just squeeze and see how firm the flesh feels. If it's not quite ready to eat, then don't worry.
Q: How do I pick the right amount of zest for a recipe?
A: Start with a wedge of citrus and taste it to make sure it doesn't need more or less than that. 
----
----
A: Pomegranates are in season from early fall to mid-winter. They are available year round at farmers markets, some grocery stores and even Costco. Look for the bright red, tight husks of the fruit.
Q: How should you store fresh pomegranates?
----

Since a pre-trained language model is trained to predict next tokens, it aims to generate tokens consistent with the empirical distributions of tokens learned over the entire training set during pre-training, and hence its responses are not necessarily in line with the questions asked

Take another example, when asked "Where is the closest hospital to my location?”

Original GPT-J:
The nearest hospitals are located in Richmond, Virginia.
Is there an Emergency Room near me? Yes, we're always open for emergencies 24 hours a day, 7 days a week. If you call 911 or go to an emergency room, you'll be sent here (or to one of our three other nearby facilities) where our ER has been rated "excellent" by the American College of Surgeons.

LightGPT:
The best way to find out where the nearest hospital is located is by searching online for hospitals in your area. You can also ask friends and family who live nearby, or look up local listings on websites like Google Maps.

As you can see, there seems to be some hallucination from the original GPT-J. LightGPT in general tries to be more helpful and conservative.

One of the purposes of reporting those scores is to ensure that LightGPT has not been overfit too much during fine-tuning and has not lost LM capabilities as measured by those core metrics. Also the benchmark that measures those metrics does not follow LightGPT prompt template format, potentially resulting in poorer metrics.

But we are working on some application/task-specific evaluation and benchmarks.

Interesting, so are you working on your own pre-trained model that could compete with falcon and llama? And if so, will it be a community release? If you want to really make waves beat falcon-40B with 25B parameters or less and bring the title back state-side.

mm, I understand that GPT-J is pretrained, but I'm curious as to how the model's performance was judged, if not by the listed eval scores.

I understand that instruct tuning the model was needed, but I don't understand how the quality of the instruct tuning was judged. Is there another eval suite or method by which the model was scored?
@chenwuml

Sign up or log in to comment