djstrong commited on
Commit
705e23f
·
1 Parent(s): 59e91d5
Files changed (1) hide show
  1. src/about.py +13 -2
src/about.py CHANGED
@@ -41,9 +41,11 @@ TITLE = """<h1 align="center" id="space-title">Open PL LLM Leaderboard (0-shot a
41
 
42
  # What does your leaderboard evaluate?
43
  INTRODUCTION_TEXT = """
44
- _g suffix means that a model needs to generate an answer (only suitable for instructions-based models)
45
 
46
- _mc suffix means that a model is scored against every possible class (suitable also for base models)
 
 
47
  """
48
 
49
  # Which evaluations are you running? how can people reproduce what you have?
@@ -54,6 +56,15 @@ Contact with me: [LinkedIn](https://www.linkedin.com/in/wrobelkrzysztof/)
54
 
55
  or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
56
 
 
 
 
 
 
 
 
 
 
57
  ## Evaluation metrics
58
 
59
  - **belebele_pol_Latn**: accuracy
 
41
 
42
  # What does your leaderboard evaluate?
43
  INTRODUCTION_TEXT = """
44
+ The leaderboard evaluates language models on a set of Polish tasks. The tasks are designed to test the models' ability to understand and generate Polish text. The leaderboard is designed to be a benchmark for the Polish language model community, and to help researchers and practitioners understand the capabilities of different models.
45
 
46
+ Almost every task has two versions: regex and multiple choice. The regex version is scored based on exact match, while the multiple choice version is scored based on accuracy.
47
+ * _g suffix means that a model needs to generate an answer (only suitable for instructions-based models)
48
+ * _mc suffix means that a model is scored against every possible class (suitable also for base models)
49
  """
50
 
51
  # Which evaluations are you running? how can people reproduce what you have?
 
56
 
57
  or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
58
 
59
+ ## TODO
60
+
61
+ * change metrics for DYK, PSC, CBD(?)
62
+ * fix names of our models
63
+ * add inference time
64
+ * add metadata for models (e.g. #Params)
65
+ * add more tasks
66
+ * add baselines
67
+
68
  ## Evaluation metrics
69
 
70
  - **belebele_pol_Latn**: accuracy