Spaces:

JetBrains-Research
/

long-code-arena

Running

App Files Files Community

Areyde commited on Jun 5, 2024

Commit

8314e15

verified ·

1 Parent(s): 96a7ff5

Update src/tasks_content.py

Browse files

Files changed (1) hide show

src/tasks_content.py +11 -7

src/tasks_content.py CHANGED Viewed

@@ -27,10 +27,10 @@ TASKS_DESCRIPTIONS = {
         Our CI builds repair benchmark 🤗 [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
         includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
-        The benchmark clones the repo to the local folder. The baseline model fixes the issue according to logs and the local repo state,
-        and then the benchmark pushes the repo to GitGub and requests the result of the GitHub CI.
-        We use the `Pass@1` rate metric for CI repair.
-        Models can be evaluated in three types of tasks:
         * `full` – **no** ground truth diffs are used for model evaluation;
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
@@ -45,7 +45,9 @@ TASKS_DESCRIPTIONS = {
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
         * `huge-context`: 296 data points.
         We use standard `Exact Match (EM)` metric for one-line code completion.
         We evaluate `Exact Match` for different line categories:
         * *infile* – functions and classes are from the completion file;
@@ -60,7 +62,7 @@ TASKS_DESCRIPTIONS = {
     "commit_message_generation": """# Commit message generation\n
-        Our Commit message generation benchmark 🤗 [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits from 34 Python projects.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
@@ -75,7 +77,8 @@ TASKS_DESCRIPTIONS = {
     "bug_localization": """# Bug localization\n
-        Our Bug localization benchmark 🤗 [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
         We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
         For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
@@ -84,6 +87,7 @@ TASKS_DESCRIPTIONS = {
     "module_summarization": """# Module summarization\n
         Our Module summarization benchmark 🤗 [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
         We use a novel metric for evaluation:
         * `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).

         Our CI builds repair benchmark 🤗 [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
         includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
+        The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
+        and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
+        We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix.
+        Models can be evaluated in three settings:
         * `full` – **no** ground truth diffs are used for model evaluation;
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
         * `huge-context`: 296 data points.
+        Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below),
+        and a repository snapshot that can be used to build the context.
         We use standard `Exact Match (EM)` metric for one-line code completion.
         We evaluate `Exact Match` for different line categories:
         * *infile* – functions and classes are from the completion file;
     "commit_message_generation": """# Commit message generation\n
+        Our Commit message generation benchmark 🤗 [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
     "bug_localization": """# Bug localization\n
+        Our Bug localization benchmark 🤗 [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
+        The model needs to identify the files within the repository that need to be modified to address the reported bug.
         We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
         For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
     "module_summarization": """# Module summarization\n
         Our Module summarization benchmark 🤗 [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
+        The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         We use a novel metric for evaluation:
         * `CompScore`: a new metric proposed for this task. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).