ai-forever commited on
Commit
30a7a04
·
verified ·
1 Parent(s): 0fb3ba6

Update docs/description.md

Browse files
Files changed (1) hide show
  1. docs/description.md +67 -63
docs/description.md CHANGED
@@ -1,63 +1,67 @@
1
- # LIBRA: Long Input Benchmark for Russian Analysis
2
-
3
- <img src="https://i.imgur.com/BNleRrG.png" width="800" />
4
-
5
- ## Dataset Summary
6
-
7
- LIBRA (Long Input Benchmark for Russian Analysis) is designed to evaluate the capabilities of large language models (LLMs) in understanding and processing long texts in Russian. This benchmark includes 21 datasets adapted for different tasks and complexities. The tasks are divided into four complexity groups and allow evaluation across various context lengths ranging from 4,000 up to 128,000 tokens.
8
-
9
- ## Tasks and Complexity Groups
10
-
11
- ### Group I: Simple Information Retrieval
12
- - **Passkey**: Extract a relevant piece of code number from a long text fragment.
13
- - **PasskeyWithLibrusec**: Similar to Passkey but with added noise from Librusec texts.
14
-
15
- ### Group II: Question Answering and Multiple Choice
16
- - **MatreshkaNames**: Identify the person in dialogues based on the discussed topic.
17
- - **MatreshkaYesNo**: Indicate whether a specific topic was mentioned in the dialog.
18
- - **LibrusecHistory**: Answer questions based on historical texts.
19
- - **ruTREC**: Few-shot in-context learning for topic classification. Created by translating the TREC dataset from LongBench.
20
- - **ruSciFi**: Answer true/false based on context and general world knowledge. Translation of SciFi dataset from L-Eval.
21
- - **ruSciAbstractRetrieval**: Retrieve relevant paragraphs from scientific abstracts.
22
- - **ruTPO**: Multiple-choice questions similar to TOEFL exams. Translation of the TPO dataset from L-Eval.
23
- - **ruQuALITY**: Multiple-choice QA tasks based on detailed texts. Created by translating the QuALITY dataset from L-Eval.
24
-
25
- ### Group III: Multi-hop Question Answering
26
- - **ruBABILongQA**: 5 long-context reasoning tasks for QA using facts hidden among irrelevant information.
27
- - **LongContextMultiQ**: Multi-hop QA based on Wikidata and Wikipedia.
28
- - **LibrusecMHQA**: Multi-hop QA requiring information distributed across several text parts.
29
- - **ru2WikiMultihopQA**: Translation of the 2WikiMultihopQA dataset from LongBench.
30
-
31
- ### Group IV: Complex Reasoning and Mathematical Problems
32
- - **ruSciPassageCount**: Count unique paragraphs in a long text.
33
- - **ruQasper**: Question Answering over academic research papers. Created by translating the Qasper dataset from LongBench.
34
- - **ruGSM100**: Solve math problems using Chain-of-Thought reasoning.
35
-
36
- ## Dataset Structure
37
-
38
- The datasets are divided into subsets based on context lengths: 4k, 8k, 16k, 32k, 64k, and 128k tokens. Each subset contains a different number of samples depending on the task complexity.
39
-
40
- ## Usage
41
-
42
- The LIBRA benchmark is available under the MIT license. Researchers and developers can use these datasets to evaluate the long-context understanding abilities of various LLMs. The datasets, codebase, and public leaderboard are open-source to guide forthcoming research in this area.
43
-
44
- ## Citation
45
-
46
- _TODO_
47
-
48
- @article{LIBRA2024,
49
- title={Long Input Benchmark for Russian Analysis},
50
- author={Anonymous},
51
- journal={ACL},
52
- year={2024}
53
- }
54
-
55
- ## License
56
-
57
- The datasets are published under the MIT license.
58
-
59
- ## Acknowledgments
60
-
61
- _TODO_
62
-
63
- For more details and code, please visit our [GitHub repository](#).
 
 
 
 
 
1
+ # LIBRA: Long Input Benchmark for Russian Analysis
2
+
3
+ <img src="https://i.imgur.com/BNleRrG.png" width="800" />
4
+
5
+ ## Dataset Summary
6
+
7
+ LIBRA (Long Input Benchmark for Russian Analysis) is designed to evaluate the capabilities of large language models (LLMs) in understanding and processing long texts in Russian. This benchmark includes 21 datasets adapted for different tasks and complexities. The tasks are divided into four complexity groups and allow evaluation across various context lengths ranging from 4,000 up to 128,000 tokens.
8
+
9
+ ## Tasks and Complexity Groups
10
+
11
+ ### Group I: Simple Information Retrieval
12
+ - **Passkey**: Extract a relevant piece of code number from a long text fragment. Based on the original [PassKey test](https://github.com/CStanKonrad/long_llama/blob/main/examples/passkey.py) from the m LongLLaMA’s GitHub repo.
13
+ - **PasskeyWithLibrusec**: Similar to Passkey but with added noise from Librusec texts.
14
+
15
+ ### Group II: Question Answering and Multiple Choice
16
+ - **MatreshkaNames**: Identify the person in dialogues based on the discussed topic. We used [Matreshka](https://huggingface.co/datasets/zjkarina/matreshka) dataset and [Russian Names](https://www.kaggle.com/datasets/rai220/russian-cyrillic-names-and-sex/data) dataset to create this and the next task.
17
+ - **MatreshkaYesNo**: Indicate whether a specific topic was mentioned in the dialog.
18
+ - **LibrusecHistory**: Answer questions based on historical texts. Ideologically similiar to the [PassageRetrieval dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/passage_retrieval_en) from LongBench.
19
+ - **ruTREC**: Few-shot in-context learning for topic classification. Created by translating the [TREC dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/trec_e) from LongBench.
20
+ - **ruSciFi**: Answer true/false based on context and general world knowledge. Translation of [SciFi dataset](https://huggingface.co/datasets/L4NLP/LEval/viewer/sci_f) from L-Eval which originally was based on [SF-Gram](https://github.com/nschaetti/SFGram-dataset).
21
+ - **ruSciAbstractRetrieval**: Retrieve relevant paragraphs from scientific abstracts.
22
+ - **ruTPO**: Multiple-choice questions similar to TOEFL exams. Translation of the [TPO dataset](https://huggingface.co/datasets/L4NLP/LEval/viewer/tpo) from L-Eval.
23
+ - **ruQuALITY**: Multiple-choice QA tasks based on detailed texts. Created by translating the [QuALITY dataset](https://huggingface.co/datasets/L4NLP/LEval/viewer/quality) from L-Eval.
24
+
25
+ ### Group III: Multi-hop Question Answering
26
+ - **ruBABILongQA**: 5 long-context reasoning tasks for QA using facts hidden among irrelevant information.
27
+ - **LongContextMultiQ**: Multi-hop QA based on Wikidata and Wikipedia.
28
+ - **LibrusecMHQA**: Multi-hop QA requiring information distributed across several text parts.
29
+ - **ru2WikiMultihopQA**: Translation of the [2WikiMultihopQA dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/2wikimqa_e) from LongBench.
30
+
31
+ ### Group IV: Complex Reasoning and Mathematical Problems
32
+ - **ruSciPassageCount**: Count unique paragraphs in a long text. Uses the basic idea of the original [PassageCount dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/passage_count) from LongBench.
33
+ - **ruQasper**: Question Answering over academic research papers. Created by translating the [Qasper dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/qasper_e) from LongBench.
34
+ - **ruGSM100**: Solve math problems using Chain-of-Thought reasoning. Created by translating the [GSM100](https://huggingface.co/datasets/L4NLP/LEval/viewer/gsm100) dataset from L-Eval.
35
+
36
+ ## Dataset Structure
37
+
38
+ The datasets are divided into subsets based on context lengths: 4k, 8k, 16k, 32k, 64k, and 128k tokens. Each subset contains a different number of samples depending on the task complexity.
39
+
40
+ ## Add your model
41
+
42
+ For placing your model to leaderboard you need to score the model on our repository, then save the result in the format "<model_name>.json" in the "results" folder, and create a Pull Request.
43
+
44
+ ## *GPT-4o
45
+
46
+ Because of limited resources, we assessed GPT-4o on just 10% of each dataset in our benchmark, including each context length. Consequently, the results might not be exact.
47
+
48
+ ## Citation
49
+
50
+ _TODO_
51
+
52
+ ```
53
+ @article{LIBRA2024,
54
+ title={Long Input Benchmark for Russian Analysis},
55
+ author={Anonymous},
56
+ journal={ACL},
57
+ year={2024}
58
+ }
59
+ ```
60
+
61
+ ## License
62
+
63
+ The datasets are published under the MIT license.
64
+
65
+ ## Acknowledgments
66
+
67
+ For more details and code, please visit our [GitHub repository](https://github.com/ai-forever/LIBRA/).