long-code-arena / src /tasks_content.py
Areyde's picture
Update src/tasks_content.py
e4c0a84 verified
raw
history blame
11.5 kB
from typing import Optional
TASKS_PRETTY = {
"library_based_code_generation": "Library-based code generation",
"ci_builds_repair": "CI builds repair",
"project_code_completion": "Project-level code completion",
"commit_message_generation": "Commit message generation",
"bug_localization": "Bug localization",
"module_summarization": "Module Summarization",
}
TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}
TASKS_DESCRIPTIONS = {
"library_based_code_generation": """# Library-based code generation\n
Our Library-based code generation benchmark πŸ€— [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
For evaluation, we use two metrics:
* `ChrF`: textual similarity between the generated code and the reference program.
* `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,
For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com.
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"ci_builds_repair": """# CI builds repair\n
Our CI builds repair benchmark πŸ€— [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix.
Models can be evaluated in three settings:
* `full` – **no** ground truth diffs are used for model evaluation;
* `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
* `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com.
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"project_code_completion": """# Project-level code completion\n
Our Project-level code completion benchmark πŸ€— [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four sets of samples:
* `small-context`: 144 data points,
* `medium-context`: 224 data points,
* `large-context`: 270 data points,
* `huge-context`: 296 data points.
Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below),
and a repository snapshot that can be used to build the context.
We use standard `Exact Match (EM)` metric for one-line code completion.
We evaluate `Exact Match` for different line categories:
* *infile* – functions and classes are from the completion file;
* *inproject* – functions and files are from the repository snapshot at the moment of completion;
* *committed* – functions and classes are from the files that were added on the completion file commit;
* *common* – functions and classes with common names, e.g., `main`, `get`, etc.;
* *non-informative* – short/long lines, import/print lines, or comment lines;
* *random* – lines that don't fit any of the previous categories.
For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com.
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"commit_message_generation": """# Commit message generation\n
Our Commit message generation benchmark πŸ€— [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.
We use the following metrics for evaluation:
* [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
* [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)
* [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
* [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
**Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com.
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"bug_localization": """# Bug localization\n
Our Bug localization benchmark πŸ€— [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
The model needs to identify the files within the repository that need to be modified to address the reported bug.
We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com.
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"module_summarization": """# Module summarization\n
Our Module summarization benchmark πŸ€— [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
The model is required to generate such description, given the relevant context code and the intent behind the documentation.
We use a novel metric for evaluation:
* `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/).
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com.
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
}
def get_submission_text_files_for_task(task_pretty: Optional[str]) -> str:
if not task_pretty:
return "Please, select a specific task to see more detailed instructions regarding submitting files."
task_id = TASKS_PRETTY_REVERSE[task_pretty]
if task_id == "commit_message_generation":
return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by 🏟️ Long Code Arena Team in πŸ€— [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""
return f"**{task_pretty} Instructions:**\n\n* 🚧 There are no instructions for the current task yet."