File size: 3,640 Bytes
93ff886
 
 
 
 
 
 
3907263
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d14712
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
title: Commit Rewriting Visualization
sdk: gradio
sdk_version: 4.25.0
app_file: change_visualizer.py
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Description

This project is a main artifact of the "Research on evaluation for AI Commit Message Generation" research.

# Structure (important components)

- ### Configuration: [config.py](config.py)
    - Grazie API JWT token and Hugging Face token must be stored as environment variables.
- ### Visualization app -- a Gradio application that is currently deployed
  at https://huggingface.co/spaces/JetBrains-Research/commit-rewriting-visualization.
    - Shows
        - The "golden" dataset of manually collected samples; the dataset is downloaded on startup
          from https://huggingface.co/datasets/JetBrains-Research/commit-msg-rewriting
        - The entire dataset that includes the synthetic samples; the dataset is downloaded on startup
          from https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting
        - Some statistics collected for the dataset (and its parts); computed on startup

      _Note: datasets updated => need to restart the app to see the changes._
    - Files
        - [change_visualizer.py](change_visualizer.py)
- ### Data processing pipeline (_note: datasets and files names can be changed in the configuration file_)
    - Run the whole pipeline by running [run_pipeline.py](run_pipeline.py)
        - All intermediate results are stored as files defined in config
    - Intermediate steps (can run them separately by running the corresponding files
      from [generation_steps](generation_steps)). The input is then taken from the previous step's artifact.
    - Generate the synthetic samples
        - Files [generation_steps/synthetic_end_to_start.py](generation_steps/synthetic_end_to_start.py)
          and [generation_steps/synthetic_start_to_end.py](generation_steps/synthetic_start_to_end.py)
        - The first generation step (end to start) downloads the `JetBrains-Research/commit-msg-rewriting`
          and `JetBrains-Research/lca-commit-message-generation` datasets from
          Hugging Face datasets.
    - Compute metrics
        - File [generation_steps/metrics_analysis.py](generation_steps/metrics_analysis.py)
        - Includes the functions for all metrics
        - Downloads `JetBrains-Research/lca-commit-message-generation` Hugging Face dataset.
    - The resulting artifact (dataset with golden and synthetic samples, attached reference messages and computed
      metrics) is saved to the file [output/synthetic.csv](output/synthetic.csv). It should be uploaded
      to https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting **manually**.
- ### Data analysis
    - [analysis_util.py](analysis_util.py) -- some functions used for data analysis, e.g., correlations computation.
    - [analysis.ipynb](analysis.ipynb) -- compute the correlations, the resulting tables.
    - [chart_processing.ipynb](chart_processing.ipynb) -- Jupyter Notebook that draws the charts that were used in the
      presentation/thesis.
    - [generated_message_length_comparison.ipynb](generated_message_length_comparison.ipynb) -- compare the average 
      length of commit messages generated using the current prompt (one used in the research) and the production prompt 
      (one used to generate the messages that are measured in FUS logs). _Not finished, because could not get a Grazie 
      token; as soon as the token is received, the notebook can be run by following the instructions from the notebook._