Nihal D'Souza
Final app release
e41b03f

SRC Directory

abstractive_sum.py

  • Contains all the neceassary functions for the abstractive summarizer.
  • Functions from this script are called when 'abstractive' or 'both' are selected as methods for summarization from the UI.

clean.py

  • Contains all the necessary functions for text cleaning and preprocessing.
  • Cleaning such as removal of HTML, PHP, JSON, RTF and other elements such as URLs and unwanted characters are defined as functions here.
  • Segmentation functions such as sentence and paragraph segmentations are defined here as well.

diff.py

  • Responsible for generating the diff when 'Display Cleaned License + Diff' is selected under 'Cleaned License View' option in the UI.

doc2vec.py

  • Preprocesses the (cleaned) input text such that it can be converted into a vector. It then compares this vector against 41 other vectors representing a list of known licenses from choosealicense.com.
  • The 41 license vectors are pre-trained into a model and stored here.

evaluate.py

  • Contains function used to calculate the performance metrics of the custom textrank algorithm such as precision, recall, F1 score @k.

parameters.py

  • Contains the custom vocabulary and custom scores for the custom TextRank algorithm.
  • Contains the string macros for UI options and colors.
  • Also contains parameters and hyperparameters for the complete application.

read_data.py

  • Contains functions for ingestion of data from files stored in the data folder. This includes information such as license names, properties and labels that are processed into suitable data structures such as dataframes and dictionaries.

textrank.py

  • Contains functions needed to run the custom TextRank algorithm for extractive summarization.

tfidf.py

  • Used during EDA to calculate the TF-IDF scores to obtain the most important words while developing the custom TextRank algorithm.