Jay commited on
Commit
c5184ef
Β·
1 Parent(s): c66aadd

fix: update text

Browse files
Files changed (2) hide show
  1. app.py +1 -1
  2. assets/text.py +19 -12
app.py CHANGED
@@ -194,7 +194,7 @@ with gr.Blocks() as demo:
194
  elem_id="leaderboard-table",
195
  )
196
 
197
- with gr.TabItem("πŸ… Multiple Choice", elem_id="od-benchmark-tab-table", id=5):
198
  dataframe_all_per = gr.components.Dataframe(
199
  elem_id="leaderboard-table",
200
  )
 
194
  elem_id="leaderboard-table",
195
  )
196
 
197
+ with gr.TabItem("πŸ… Perplexity", elem_id="od-benchmark-tab-table", id=5):
198
  dataframe_all_per = gr.components.Dataframe(
199
  elem_id="leaderboard-table",
200
  )
assets/text.py CHANGED
@@ -6,10 +6,12 @@ On this leaderboard, we share the evaluation results of LLMs obtained by develop
6
 
7
  # Dataset
8
  <span style="font-size:16px; font-family: 'Times New Roman', serif">
9
- To evaluate the conformity of large language models, we present ChineseSafe, a content moderation benchmark for Chinese (Mandarin).
10
- In this benchmark, we include 4 common types of safety issues: Crime, Ethic, Mental health, and their Variant/Homophonic words.
11
- In particular, the benchmark is constructed as a balanced dataset, containing safe and unsafe data collected from internet resources and public datasets [1,2,3].
12
- We hope the evaluation can provide a reference for researchers and engineers to build safe LLMs in Chinese. <br>
 
 
13
 
14
  The leadboard is under construction and maintained by <a href="https://hongxin001.github.io/" target="_blank">Hongxin Wei's</a> research group at SUSTech.
15
  We will release the technical report in the near future.
@@ -30,7 +32,7 @@ where <b>std</b> indicates the standard deviation of the results obtained from d
30
  EVALUTION_TEXT= """
31
  # Evaluation
32
  <span style="font-size:16px; font-family: 'Times New Roman', serif">
33
- We evaluate the models using two methods: multiple choice (perplexity) and generation.
34
  For perplexity, we select the label which is the lowest perplexity as the predicted results.
35
  For generation, we use the content generated by the model to make prediction.
36
  The following are the results of the evaluation. πŸ‘‡πŸ‘‡πŸ‘‡
@@ -48,16 +50,21 @@ REFERENCE_TEXT = """
48
 
49
  """
50
 
 
 
 
 
 
 
 
 
 
51
  ACKNOWLEDGEMENTS_TEXT = """
52
- # Acknowledgements
53
- <span style="font-size:16px; font-family: 'Times New Roman', serif">
54
- This research is supported by "Data+AI" Data Intelligent Laboratory,
55
- a joint lab constructed by Deepexi and Department of Statistics and Data Science at SUSTech.
56
- We gratefully acknowledge the contributions of Prof. Bingyi Jing, Prof. Lili Yang,
57
- and Asst. Prof.Guanhua Chen for their support throughout this project.
58
  """
59
 
60
-
61
  CONTACT_TEXT = """
62
  # Contact
63
  <span style="font-size:16px; font-family: 'Times New Roman', serif">
 
6
 
7
  # Dataset
8
  <span style="font-size:16px; font-family: 'Times New Roman', serif">
9
+ To evaluate the safety risk of LLMs of large language models, we present ChineseSafe, a Chinese safety benchmark to facilitate research
10
+ on the content safety of large language models for Chinese (Mandarin).
11
+ To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples
12
+ across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography,
13
+ and variant/homophonic words. In particular, the benchmark is constructed as a balanced dataset, containing safe and unsafe data collected from internet resources and public datasets [1,2,3].
14
+ We hope the evaluation can provides a guideline for developers and researchers to facilitate the safety of LLMs. <br>
15
 
16
  The leadboard is under construction and maintained by <a href="https://hongxin001.github.io/" target="_blank">Hongxin Wei's</a> research group at SUSTech.
17
  We will release the technical report in the near future.
 
32
  EVALUTION_TEXT= """
33
  # Evaluation
34
  <span style="font-size:16px; font-family: 'Times New Roman', serif">
35
+ We evaluate the models using two methods: perplexity(multiple choice) and generation.
36
  For perplexity, we select the label which is the lowest perplexity as the predicted results.
37
  For generation, we use the content generated by the model to make prediction.
38
  The following are the results of the evaluation. πŸ‘‡πŸ‘‡πŸ‘‡
 
50
 
51
  """
52
 
53
+ # ACKNOWLEDGEMENTS_TEXT = """
54
+ # # Acknowledgements
55
+ # <span style="font-size:16px; font-family: 'Times New Roman', serif">
56
+ # This research is supported by "Data+AI" Data Intelligent Laboratory,
57
+ # a joint lab constructed by Deepexi and Department of Statistics and Data Science at SUSTech.
58
+ # We gratefully acknowledge the contributions of Prof. Bingyi Jing, Prof. Lili Yang,
59
+ # and Asst. Prof.Guanhua Chen for their support throughout this project.
60
+ # """
61
+
62
  ACKNOWLEDGEMENTS_TEXT = """
63
+ This research is supported by the Shenzhen Fundamental Research Program (Grant No.
64
+ JCYJ20230807091809020). We gratefully acknowledge the support of "Data+AI" Data Intelligent Laboratory, a joint lab constructed by Deepexi and the Department of Statistics and Data Science
65
+ at Southern University of Science and Technology.
 
 
 
66
  """
67
 
 
68
  CONTACT_TEXT = """
69
  # Contact
70
  <span style="font-size:16px; font-family: 'Times New Roman', serif">