khang119966
commited on
Commit
•
c38c780
1
Parent(s):
0e819ae
Update README.md
Browse files
README.md
CHANGED
@@ -50,7 +50,7 @@ The fine-tuning dataset was meticulously sampled in part from the following data
|
|
50 |
|
51 |
## Benchmarks 📈
|
52 |
|
53 |
-
Since there are still many different metrics that need to be tested, we chose a quick and simple metric first to guide the development of our model. Our metric is inspired by Lavy[4]. For the time being, we are using GPT-4 to evaluate the quality of answers on two datasets: OpenViVQA and ViTextVQA. Detailed results can be found at the provided . The inputs are images, questions, labels, and predicted answers. The model will return a score from 0 to 10 for the corresponding answer quality. The results table is shown below.
|
54 |
|
55 |
<table border="1" cellspacing="0" cellpadding="5">
|
56 |
<tr align="center">
|
|
|
50 |
|
51 |
## Benchmarks 📈
|
52 |
|
53 |
+
Since there are still many different metrics that need to be tested, we chose a quick and simple metric first to guide the development of our model. Our metric is inspired by Lavy[4]. For the time being, we are using GPT-4 to evaluate the quality of answers on two datasets: OpenViVQA and ViTextVQA. Detailed results can be found at the provided [here](https://huggingface.co/datasets/5CD-AI/Vintern-1B-v2-Benchmark-gpt4o-score). The inputs are images, questions, labels, and predicted answers. The model will return a score from 0 to 10 for the corresponding answer quality. The results table is shown below.
|
54 |
|
55 |
<table border="1" cellspacing="0" cellpadding="5">
|
56 |
<tr align="center">
|