Update src/tasks_content.py
Browse files- src/tasks_content.py +11 -7
src/tasks_content.py
CHANGED
|
@@ -27,10 +27,10 @@ TASKS_DESCRIPTIONS = {
|
|
| 27 |
Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
|
| 28 |
includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
|
| 29 |
|
| 30 |
-
The benchmark clones the repo to the local
|
| 31 |
-
and then the benchmark pushes the repo to
|
| 32 |
-
We use the `Pass@1` rate metric
|
| 33 |
-
Models can be evaluated in three
|
| 34 |
* `full` β **no** ground truth diffs are used for model evaluation;
|
| 35 |
* `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue;
|
| 36 |
* `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
|
|
@@ -45,7 +45,9 @@ TASKS_DESCRIPTIONS = {
|
|
| 45 |
* `medium-context`: 224 data points,
|
| 46 |
* `large-context`: 270 data points,
|
| 47 |
* `huge-context`: 296 data points.
|
| 48 |
-
|
|
|
|
|
|
|
| 49 |
We use standard `Exact Match (EM)` metric for one-line code completion.
|
| 50 |
We evaluate `Exact Match` for different line categories:
|
| 51 |
* *infile* β functions and classes are from the completion file;
|
|
@@ -60,7 +62,7 @@ TASKS_DESCRIPTIONS = {
|
|
| 60 |
|
| 61 |
"commit_message_generation": """# Commit message generation\n
|
| 62 |
|
| 63 |
-
Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits from 34 Python projects.
|
| 64 |
|
| 65 |
We use the following metrics for evaluation:
|
| 66 |
* [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
|
|
@@ -75,7 +77,8 @@ TASKS_DESCRIPTIONS = {
|
|
| 75 |
|
| 76 |
"bug_localization": """# Bug localization\n
|
| 77 |
|
| 78 |
-
Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
|
|
|
|
| 79 |
We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
|
| 80 |
|
| 81 |
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
|
|
@@ -84,6 +87,7 @@ TASKS_DESCRIPTIONS = {
|
|
| 84 |
|
| 85 |
"module_summarization": """# Module summarization\n
|
| 86 |
Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
|
|
|
|
| 87 |
|
| 88 |
We use a novel metric for evaluation:
|
| 89 |
* `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
|
|
|
|
| 27 |
Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
|
| 28 |
includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
|
| 29 |
|
| 30 |
+
The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
|
| 31 |
+
and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
|
| 32 |
+
We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix.
|
| 33 |
+
Models can be evaluated in three settings:
|
| 34 |
* `full` β **no** ground truth diffs are used for model evaluation;
|
| 35 |
* `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue;
|
| 36 |
* `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
|
|
|
|
| 45 |
* `medium-context`: 224 data points,
|
| 46 |
* `large-context`: 270 data points,
|
| 47 |
* `huge-context`: 296 data points.
|
| 48 |
+
Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below),
|
| 49 |
+
and a repository snapshot that can be used to build the context.
|
| 50 |
+
|
| 51 |
We use standard `Exact Match (EM)` metric for one-line code completion.
|
| 52 |
We evaluate `Exact Match` for different line categories:
|
| 53 |
* *infile* β functions and classes are from the completion file;
|
|
|
|
| 62 |
|
| 63 |
"commit_message_generation": """# Commit message generation\n
|
| 64 |
|
| 65 |
+
Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects.
|
| 66 |
|
| 67 |
We use the following metrics for evaluation:
|
| 68 |
* [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
|
|
|
|
| 77 |
|
| 78 |
"bug_localization": """# Bug localization\n
|
| 79 |
|
| 80 |
+
Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
|
| 81 |
+
The model needs to identify the files within the repository that need to be modified to address the reported bug.
|
| 82 |
We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
|
| 83 |
|
| 84 |
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
|
|
|
|
| 87 |
|
| 88 |
"module_summarization": """# Module summarization\n
|
| 89 |
Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
|
| 90 |
+
The model is required to generate such description, given the relevant context code and the intent behind the documentation.
|
| 91 |
|
| 92 |
We use a novel metric for evaluation:
|
| 93 |
* `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
|