Petr Tsvetkov commited on
Commit
3907263
β€’
1 Parent(s): a7bba68

Added some description to the README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -1
README.md CHANGED
@@ -5,4 +5,48 @@ sdk_version: 4.25.0
5
  app_file: change_visualizer.py
6
  ---
7
 
8
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  app_file: change_visualizer.py
6
  ---
7
 
8
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
9
+
10
+ # Description
11
+
12
+ This project is a main artifact of the "Research on evaluation for AI Commit Message Generation" research.
13
+
14
+ # Structure (important components)
15
+
16
+ - ### Configuration: [config.py](config.py)
17
+ - Grazie API JWT token and Hugging Face token must be stored as environment variables.
18
+ - ### Visualization app -- a Gradio application that is currently deployed
19
+ at https://huggingface.co/spaces/JetBrains-Research/commit-rewriting-visualization.
20
+ - Shows
21
+ - The "golden" dataset of manually collected samples; the dataset is downloaded on startup
22
+ from https://huggingface.co/datasets/JetBrains-Research/commit-msg-rewriting
23
+ - The entire dataset that includes the synthetic samples; the dataset is downloaded on startup
24
+ from https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting
25
+ - Some statistics collected for the dataset (and its parts); computed on startup
26
+
27
+ _Note: datasets updated => need to restart the app to see the changes._
28
+ - Files
29
+ - [change_visualizer.py](change_visualizer.py)
30
+ - ### Data processing pipeline (_note: datasets and files names can be changed in the configuration file_)
31
+ - Run the whole pipeline by running [run_pipeline.py](run_pipeline.py)
32
+ - All intermediate results are stored as files defined in config
33
+ - Intermediate steps (can run them separately by running the corresponding files
34
+ from [generation_steps](generation_steps)). The input is then taken from the previous step's artifact.
35
+ - Generate the synthetic samples
36
+ - Files [generation_steps/synthetic_end_to_start.py](generation_steps/synthetic_end_to_start.py)
37
+ and [generation_steps/synthetic_start_to_end.py](generation_steps/synthetic_start_to_end.py)
38
+ - The first generation step (end to start) downloads the `JetBrains-Research/commit-msg-rewriting`
39
+ and `JetBrains-Research/lca-commit-message-generation` datasets from
40
+ Hugging Face datasets.
41
+ - Compute metrics
42
+ - File [generation_steps/metrics_analysis.py](generation_steps/metrics_analysis.py)
43
+ - Includes the functions for all metrics
44
+ - Downloads `JetBrains-Research/lca-commit-message-generation` Hugging Face dataset.
45
+ - The resulting artifact (dataset with golden and synthetic samples, attached reference messages and computed
46
+ metrics) is saved to the file [output/synthetic.csv](output/synthetic.csv). It should be uploaded
47
+ to https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting **manually**.
48
+ - ### Data analysis
49
+ - [analysis_util.py](analysis_util.py) -- some functions used for data analysis, e.g., correlations computation.
50
+ - [analysis.ipynb](analysis.ipynb) -- compute the correlations, the resulting tables.
51
+ - [chart_processing.ipynb](chart_processing.ipynb) -- Jupyter Notebook that draws the charts that were used in the
52
+ presentation/thesis.