陈俊杰
commited on
Commit
•
221547a
1
Parent(s):
88915d0
cjj-leader
Browse files
app.py
CHANGED
@@ -14,8 +14,8 @@ with st.sidebar:
|
|
14 |
page = option_menu(
|
15 |
"Navigation",
|
16 |
["LeaderBoard", "Introduction", "Methodology", "Datasets", "Important Dates",
|
17 |
-
"Evaluation
|
18 |
-
icons=['trophy', 'house', 'book', 'database', 'calendar', 'clipboard', '
|
19 |
menu_icon="cast",
|
20 |
default_index=0,
|
21 |
styles={
|
@@ -143,8 +143,8 @@ elif page == "Important Dates":
|
|
143 |
<br />
|
144 |
Before the Formal run begins (before Jan 15, 2025), we will release the reserved set. Participants need to submit their results for the reserved set before the Formal run ends (before Feb 1, 2025).</p>
|
145 |
""",unsafe_allow_html=True)
|
146 |
-
elif page == "Evaluation
|
147 |
-
st.header("Evaluation
|
148 |
st.markdown("""
|
149 |
- **Acc(Accuracy):** The proportion of identical preference results between the model and human annotations. Specifically, we first convert individual scores (ranks) into pairwise preferences and then calculate consistency with human annotations.
|
150 |
|
@@ -183,22 +183,30 @@ elif page == "Data and File format":
|
|
183 |
elif page == "Submit":
|
184 |
st.header("Submit")
|
185 |
st.markdown("""
|
186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
187 |
|
188 |
-
|
189 |
|
190 |
Each team can submit up to 5 times per day, and only the latest submission will be considered.
|
191 |
|
192 |
-
|
193 |
-
|
194 |
-
A baseline example can be found in the [baseline_example](https://huggingface.co/spaces/THUIR/AEOLLM/tree/main/baseline_example) folder, where the output folder provides an [example](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt) of the submission file content.
|
195 |
""")
|
196 |
elif page == "LeaderBoard":
|
197 |
st.header("LeaderBoard")
|
198 |
# # 描述
|
199 |
st.markdown("""
|
200 |
<p class='main-text'>
|
201 |
-
NTCIR-18 Automatic Evaluation Methods of LLMs (AEOLLM) Leaderboard.
|
202 |
</p>
|
203 |
""", unsafe_allow_html=True)
|
204 |
df = {
|
@@ -309,9 +317,11 @@ NTCIR-18 Automatic Evaluation Methods of LLMs (AEOLLM) Leaderboard. <br/>To subm
|
|
309 |
st.dataframe(df,use_container_width=True)
|
310 |
|
311 |
st.markdown("""
|
312 |
-
|
313 |
|
314 |
-
|
|
|
|
|
315 |
""")
|
316 |
# 获取北京时间
|
317 |
time_placeholder = st.empty()
|
|
|
14 |
page = option_menu(
|
15 |
"Navigation",
|
16 |
["LeaderBoard", "Introduction", "Methodology", "Datasets", "Important Dates",
|
17 |
+
"Evaluation Metrics", "Submit", "Organisers", "References"],
|
18 |
+
icons=['trophy', 'house', 'book', 'database', 'calendar', 'clipboard', 'upload', 'people', 'book'],
|
19 |
menu_icon="cast",
|
20 |
default_index=0,
|
21 |
styles={
|
|
|
143 |
<br />
|
144 |
Before the Formal run begins (before Jan 15, 2025), we will release the reserved set. Participants need to submit their results for the reserved set before the Formal run ends (before Feb 1, 2025).</p>
|
145 |
""",unsafe_allow_html=True)
|
146 |
+
elif page == "Evaluation Metrics":
|
147 |
+
st.header("Evaluation Metrics")
|
148 |
st.markdown("""
|
149 |
- **Acc(Accuracy):** The proportion of identical preference results between the model and human annotations. Specifically, we first convert individual scores (ranks) into pairwise preferences and then calculate consistency with human annotations.
|
150 |
|
|
|
183 |
elif page == "Submit":
|
184 |
st.header("Submit")
|
185 |
st.markdown("""
|
186 |
+
We will be following a similar format as the ones used by most **TREC submissions**, which is repeated below.
|
187 |
+
|
188 |
+
White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly five columns per line with at least one space between the columns.
|
189 |
+
|
190 |
+
**taskId questionId answerId score rank**
|
191 |
+
|
192 |
+
- the first column is the taskeId (index different tasks)
|
193 |
+
- the second column is questionId (index different questions in the same task)
|
194 |
+
- the third column is answerId (index the answer provided by different LLMs to the same question)
|
195 |
+
- the fourth column is score (index the score to the answer given by participants)
|
196 |
+
- the fifth column is rank (index the rank of the answer within all answers to the same question)
|
197 |
|
198 |
+
Please organize the answers in a **txt** file, name the file as **teamId_methods.txt** and submit it through the link below: [https://forms.gle/ctJD5tvZkYcCw7Kz9](https://forms.gle/ctJD5tvZkYcCw7Kz9)
|
199 |
|
200 |
Each team can submit up to 5 times per day, and only the latest submission will be considered.
|
201 |
|
202 |
+
An example of the submission file content is [here](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt).
|
|
|
|
|
203 |
""")
|
204 |
elif page == "LeaderBoard":
|
205 |
st.header("LeaderBoard")
|
206 |
# # 描述
|
207 |
st.markdown("""
|
208 |
<p class='main-text'>
|
209 |
+
NTCIR-18 Automatic Evaluation Methods of LLMs (AEOLLM) Leaderboard.
|
210 |
</p>
|
211 |
""", unsafe_allow_html=True)
|
212 |
df = {
|
|
|
317 |
st.dataframe(df,use_container_width=True)
|
318 |
|
319 |
st.markdown("""
|
320 |
+
To submit, refer to the "Submit" section in the left-hand navigation bar.🤗 A baseline example can be found in the [baseline_example](https://huggingface.co/spaces/THUIR/AEOLLM/tree/main/baseline_example) folder.
|
321 |
|
322 |
+
Refer to other sections in the navigation bar for details on evaluation metrics, datasets, important dates and methodology.
|
323 |
+
|
324 |
+
The Leaderboard will be updated daily around 24:00 Beijing Time.
|
325 |
""")
|
326 |
# 获取北京时间
|
327 |
time_placeholder = st.empty()
|