lm_eval results is weird

#2
by xianf - opened

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|
This comment has been hidden

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

If few-shot is must for arc_easy, I think the model is not trained well.

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

A simple script for testing the model:

#coding:utf-8
import sys 
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import CrossEntropyLoss
import torch
 
model_path = "." 
 
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
 
text = "Question: Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?\nAnswer: by freezing the leaves on"
loss_func = CrossEntropyLoss(reduction="none")
 
input_ids = tokenizer(text, return_tensors='pt')                                                                                                                                                                                                                                                   

labels = input_ids['input_ids'][:, 1:] 
 
output = model(**input_ids)
 
logits = output.logits[:,:-1]
print(logits.size())

loss = loss_func(logits.transpose(1, 2), labels)
num_tokens = input_ids['input_ids'].size(1)
avg_loss = torch.sum(loss).item() / num_tokens

print(avg_loss)

The avg_loss value is 4.48 which is too high for a language model.

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

A simple script for testing the model:

#coding:utf-8
import sys 
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import CrossEntropyLoss
import torch
 
model_path = "." 
 
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
 
text = "Question: Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?\nAnswer: by freezing the leaves on"
loss_func = CrossEntropyLoss(reduction="none")
 
input_ids = tokenizer(text, return_tensors='pt')                                                                                                                                                                                                                                                   

labels = input_ids['input_ids'][:, 1:] 
 
output = model(**input_ids)
 
logits = output.logits[:,:-1]
print(logits.size())

loss = loss_func(logits.transpose(1, 2), labels)
num_tokens = input_ids['input_ids'].size(1)
avg_loss = torch.sum(loss).item() / num_tokens

print(avg_loss)

The avg_loss value is 4.48 which is too high for a language model.

Fine, I also tested the model on MMLU, and its zero-shot and few-shot capabilities were almost non-existent, with the options outputting only 1, 2, 3, and 4. It's unclear how the posterior probabilities for options A, B, C, and D were trained to be so high.

Sign up or log in to comment