|
from typing import Optional |
|
|
|
TASKS_PRETTY = { |
|
"library_based_code_generation": "Library-based code generation", |
|
"ci_builds_repair": "CI builds repair", |
|
"project_code_completion": "Project-level code completion", |
|
"commit_message_generation": "Commit message generation", |
|
"bug_localization": "Bug localization", |
|
"module_summarization": "Module Summarization", |
|
} |
|
TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()} |
|
|
|
TASKS_DESCRIPTIONS = { |
|
"library_based_code_generation": """# Library-based code generation\n |
|
|
|
Our Library-based code generation benchmark π€ [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries. |
|
|
|
For evaluation, we use two metrics: |
|
* `ChrF`: textual similarity between the generated code and the reference program. |
|
* `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code, |
|
|
|
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines). |
|
|
|
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com. |
|
|
|
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). |
|
""", |
|
|
|
"ci_builds_repair": """# CI builds repair\n |
|
|
|
Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair) |
|
includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build. |
|
|
|
The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state, |
|
and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI. |
|
We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix. |
|
|
|
Models can be evaluated in three settings: |
|
* `full` β **no** ground truth diffs are used for model evaluation; |
|
* `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue; |
|
* `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue; |
|
|
|
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines). |
|
|
|
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com. |
|
|
|
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). |
|
""", |
|
|
|
"project_code_completion": """# Project-level code completion\n |
|
|
|
Our Project-level code completion benchmark π€ [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four sets of samples: |
|
* `small-context`: 144 data points, |
|
* `medium-context`: 224 data points, |
|
* `large-context`: 270 data points, |
|
* `huge-context`: 296 data points. |
|
|
|
Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below), |
|
and a repository snapshot that can be used to build the context. |
|
|
|
We use standard `Exact Match (EM)` metric for one-line code completion. |
|
We evaluate `Exact Match` for different line categories: |
|
* *infile* β functions and classes are from the completion file; |
|
* *inproject* β functions and files are from the repository snapshot at the moment of completion; |
|
* *committed* β functions and classes are from the files that were added on the completion file commit; |
|
* *common* β functions and classes with common names, e.g., `main`, `get`, etc.; |
|
* *non-informative* β short/long lines, import/print lines, or comment lines; |
|
* *random* β lines that don't fit any of the previous categories. |
|
|
|
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines). |
|
|
|
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com. |
|
|
|
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). |
|
""", |
|
|
|
"commit_message_generation": """# Commit message generation\n |
|
|
|
Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for. |
|
|
|
We use the following metrics for evaluation: |
|
* [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) |
|
* [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) |
|
* [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf) |
|
* [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore) |
|
|
|
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines). |
|
|
|
**Note.** The leaderboard is sorted by the `ROUGE-1` metric by default. |
|
|
|
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com. |
|
|
|
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). |
|
|
|
""", |
|
|
|
"bug_localization": """# Bug localization\n |
|
|
|
Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects. |
|
The model needs to identify the files within the repository that need to be modified to address the reported bug. |
|
We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2. |
|
|
|
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines). |
|
|
|
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com. |
|
|
|
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). |
|
""", |
|
|
|
"module_summarization": """# Module summarization\n |
|
Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects. |
|
The model is required to generate such description, given the relevant context code and the intent behind the documentation. |
|
|
|
We use a novel metric for evaluation: |
|
* `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md). |
|
|
|
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/). |
|
|
|
If you have any questions or requests concerning this dataset, please contact us at lca@jetbrains.com. |
|
|
|
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). |
|
""", |
|
} |
|
|
|
|
|
def get_submission_text_files_for_task(task_pretty: Optional[str]) -> str: |
|
if not task_pretty: |
|
return "Please, select a specific task to see more detailed instructions regarding submitting files." |
|
|
|
task_id = TASKS_PRETTY_REVERSE[task_pretty] |
|
|
|
if task_id == "commit_message_generation": |
|
return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by ποΈ Long Code Arena Team in π€ [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional.""" |
|
|
|
return f"**{task_pretty} Instructions:**\n\n* π§ There are no instructions for the current task yet." |
|
|