Spaces:
Restarting
on
CPU Upgrade
Restarting
on
CPU Upgrade
more info
Browse files- src/about.py +13 -2
src/about.py
CHANGED
@@ -41,9 +41,11 @@ TITLE = """<h1 align="center" id="space-title">Open PL LLM Leaderboard (0-shot a
|
|
41 |
|
42 |
# What does your leaderboard evaluate?
|
43 |
INTRODUCTION_TEXT = """
|
44 |
-
|
45 |
|
46 |
-
|
|
|
|
|
47 |
"""
|
48 |
|
49 |
# Which evaluations are you running? how can people reproduce what you have?
|
@@ -54,6 +56,15 @@ Contact with me: [LinkedIn](https://www.linkedin.com/in/wrobelkrzysztof/)
|
|
54 |
|
55 |
or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
|
56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
## Evaluation metrics
|
58 |
|
59 |
- **belebele_pol_Latn**: accuracy
|
|
|
41 |
|
42 |
# What does your leaderboard evaluate?
|
43 |
INTRODUCTION_TEXT = """
|
44 |
+
The leaderboard evaluates language models on a set of Polish tasks. The tasks are designed to test the models' ability to understand and generate Polish text. The leaderboard is designed to be a benchmark for the Polish language model community, and to help researchers and practitioners understand the capabilities of different models.
|
45 |
|
46 |
+
Almost every task has two versions: regex and multiple choice. The regex version is scored based on exact match, while the multiple choice version is scored based on accuracy.
|
47 |
+
* _g suffix means that a model needs to generate an answer (only suitable for instructions-based models)
|
48 |
+
* _mc suffix means that a model is scored against every possible class (suitable also for base models)
|
49 |
"""
|
50 |
|
51 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
56 |
|
57 |
or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
|
58 |
|
59 |
+
## TODO
|
60 |
+
|
61 |
+
* change metrics for DYK, PSC, CBD(?)
|
62 |
+
* fix names of our models
|
63 |
+
* add inference time
|
64 |
+
* add metadata for models (e.g. #Params)
|
65 |
+
* add more tasks
|
66 |
+
* add baselines
|
67 |
+
|
68 |
## Evaluation metrics
|
69 |
|
70 |
- **belebele_pol_Latn**: accuracy
|