# Vietnamese News Articles Dataset ## Overview This dataset consists of Vietnamese news articles collected from various Vietnamese online news portals. The dataset was originally sourced from a MongoDB dump containing over 20 million articles. From this large dataset, our team extracted approximately 162,000 articles categorized into 13 distinct categories. Link dataset: https://github.com/binhvq/news-corpus ## Sample Data Here is an example of the original data structure: ```json { "source": "Thanh Niên", "title": "Đà Nẵng nghiên cứu tiện ích nhắn tin khi vi phạm đến chủ phương tiện", "sapo": "Theo thống kê của Phòng CSGT (PC67, Công an TP.Đà Nẵng), từ ngày 1.1.2016 đến hết tháng 1.2018, PC67 gửi 13.479 lượt thông báo đến chủ phương tiện vi phạm luật Giao thông đường bộ.", "body": "
Xử lý phạt nguội qua camera giám sát tại Phòng CSGT Công an TP.Đà Nẵng - Nguyễn Tú
..." "id": 24858235, "publish": ISODate("2018-02-04T22:15:07Z"), "tags": [], "keywords": ["Công an TP.Đà Nẵng", "Phan Văn Thương", "Luật giao thông đường bộ", ...], "cates": ["Pháp luật"] } ``` ## Dataset Preprocessing The dataset was preprocessed as follows: - Extracted two main components: `content` and `category`. - `content` includes fields such as `title`, `sapo`, `body`, and `keywords`. - `category` represents the classification labels. - Split into train, test, and validation sets with a ratio of 70%, 15%, and 15% respectively. ## File Structure - `train_data_162k.json`: JSON file containing the training set. - `test_data_162k.json`: JSON file containing the test set. - `val_data_162k.json`: JSON file containing the validation set. - `processed_data`: Folder containing preprocessed data (including 1 million 4 articles) - `news_v2`: Folder containing data that has been cleaned of the `body` (including 1 million 4 articles) - `features_162k_phobertbase.pkl`: File containing features extracted from the PhoBert model