# DM Project

The project consists of three parts: the classification, the GPT and its evaluation (GPT Evaluation) and the news scraper + LDA modelling.

all the files can be run independently without interference of others and pre-setup, except for the LDA folder. 

For LDA, 
the processed_data.parquet, processed the original data and is generated by the file basic_text_preprocessing, is used in topic_modelling_benchmark_using_headline;

the processed_data1.parquet, processed the scraped news content and is generated by the file basic_text_preprocessing_on_scraped_data, is used in files topic_modelling_minor, topic_modelling_severe and topic_modelling_moderate.

However, be careful when running the GPT and news scraper file, as you may need your own API key for GPT script to run properly. Also, it takes very long for the news scraper scirpt to finish.