## SRC Directory

### [abstractive_sum.py](/src/abstractive_sum.py)
- Contains all the neceassary functions for the abstractive summarizer.
- Functions from this script are called when 'abstractive' or 'both' are selected as methods for summarization from the UI.

### [clean.py](/src/clean.py)
- Contains all the necessary functions for text cleaning and preprocessing. 
- Cleaning such as removal of HTML, PHP, JSON, RTF and other elements such as URLs and unwanted characters are defined as functions here. 
- Segmentation functions such as sentence and paragraph segmentations are defined here as well. 

### [diff.py](/src/diff.py)
- Responsible for generating the diff when 'Display Cleaned License + Diff' is selected under 'Cleaned License View' option in the UI. 

### [doc2vec.py](/src/doc2vec.py) 
- Preprocesses the (cleaned) input text such that it can be converted into a vector. It then compares this vector against 41 other vectors representing a list of known licenses from [choosealicense.com](https://www.choosealicense.com/appendix). 
- The 41 license vectors are pre-trained into a model and stored [here](/models/).

### [evaluate.py](/src/evaluate.py)
- Contains function used to calculate the performance metrics of the custom textrank algorithm such as precision, recall, F1 score @k.

### [parameters.py](/src/parameters.py)
- Contains the custom vocabulary and custom scores for the custom TextRank algorithm. 
- Contains the string macros for UI options and colors. 
- Also contains parameters and hyperparameters for the complete application.

### [read_data.py](/src/read_data.py)
- Contains functions for ingestion of data from files stored in the [data](/data/) folder. This includes information such as license names, properties and labels that are processed into suitable data structures such as dataframes and dictionaries.

### [textrank.py](/src/textrank.py)
- Contains functions needed to run the custom TextRank algorithm for extractive summarization.

### [tfidf.py](/src/tfidf.py)
- Used during EDA to calculate the TF-IDF scores to obtain the most important words while developing the custom TextRank algorithm.