shezamunir commited on
Commit
c1ecfc3
·
verified ·
1 Parent(s): 502e097

Update src/streamlit_app.py

Browse files
Files changed (1) hide show
  1. src/streamlit_app.py +17 -4
src/streamlit_app.py CHANGED
@@ -89,11 +89,24 @@ with tab1:
89
  st.markdown(html, unsafe_allow_html=True)
90
 
91
  with tab2:
 
 
 
92
  st.markdown("## Abstract")
93
  st.write(
94
- "<add final abstract here>"
 
 
 
 
95
  )
 
96
  st.markdown("## Pipeline")
97
- st.write(
98
- "<add final pipeline figure here>"
99
- )
 
 
 
 
 
 
89
  st.markdown(html, unsafe_allow_html=True)
90
 
91
  with tab2:
92
+ pipeline_image = Image.open("src/pipeline.png")
93
+ pipeline_image.save(buffered2, format="PNG")
94
+ img_data_pipeline = base64.b64encode(buffered2.getvalue()).decode("utf-8")
95
  st.markdown("## Abstract")
96
  st.write(
97
+ "The paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications.
98
+ Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task includes rubrics, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR to support accurate evaluation of long-form model outputs on our benchmark.
99
+ For fine-grained, expert-aligned evaluation, CLEAR derives checklists from model outputs and reference outputs by extracting information corresponding to items on the task-specific rubrics.
100
+ Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation.
101
+ We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage."
102
  )
103
+
104
  st.markdown("## Pipeline")
105
+ st.markdown(
106
+ f"""
107
+ <div class="logo-container" style="display:flex; justify-content: center;">
108
+ <img src="data:image/png;base64,{img_data_pipeline}" style="width:50%; max-width:700px;"/>
109
+ </div>
110
+ """,
111
+ unsafe_allow_html=True
112
+ )