## SRC Directory ### [abstractive_sum.py](/src/abstractive_sum.py) - Contains all the neceassary functions for the abstractive summarizer. - Functions from this script are called when 'abstractive' or 'both' are selected as methods for summarization from the UI. ### [clean.py](/src/clean.py) - Contains all the necessary functions for text cleaning and preprocessing. - Cleaning such as removal of HTML, PHP, JSON, RTF and other elements such as URLs and unwanted characters are defined as functions here. - Segmentation functions such as sentence and paragraph segmentations are defined here as well. ### [diff.py](/src/diff.py) - Responsible for generating the diff when 'Display Cleaned License + Diff' is selected under 'Cleaned License View' option in the UI. ### [doc2vec.py](/src/doc2vec.py) - Preprocesses the (cleaned) input text such that it can be converted into a vector. It then compares this vector against 41 other vectors representing a list of known licenses from [choosealicense.com](https://www.choosealicense.com/appendix). - The 41 license vectors are pre-trained into a model and stored [here](/models/). ### [evaluate.py](/src/evaluate.py) - Contains function used to calculate the performance metrics of the custom textrank algorithm such as precision, recall, F1 score @k. ### [parameters.py](/src/parameters.py) - Contains the custom vocabulary and custom scores for the custom TextRank algorithm. - Contains the string macros for UI options and colors. - Also contains parameters and hyperparameters for the complete application. ### [read_data.py](/src/read_data.py) - Contains functions for ingestion of data from files stored in the [data](/data/) folder. This includes information such as license names, properties and labels that are processed into suitable data structures such as dataframes and dictionaries. ### [textrank.py](/src/textrank.py) - Contains functions needed to run the custom TextRank algorithm for extractive summarization. ### [tfidf.py](/src/tfidf.py) - Used during EDA to calculate the TF-IDF scores to obtain the most important words while developing the custom TextRank algorithm.