hienntd commited on
Commit
da1f8dc
1 Parent(s): c2a30b3
data/README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Vietnamese News Articles Dataset
2
+
3
+ ## Overview
4
+ This dataset consists of Vietnamese news articles collected from various Vietnamese online news portals. The dataset was originally sourced from a MongoDB dump containing over 20 million articles. From this large dataset, our team extracted approximately 162,000 articles categorized into 13 distinct categories.
5
+
6
+ Link dataset: https://github.com/binhvq/news-corpus
7
+
8
+ ## Sample Data
9
+ Here is an example of the original data structure:
10
+
11
+ ```json
12
+ {
13
+ "source": "Thanh Niên",
14
+ "title": "Đà Nẵng nghiên cứu tiện ích nhắn tin khi vi phạm đến chủ phương tiện",
15
+ "sapo": "Theo thống kê của Phòng CSGT (PC67, Công an TP.Đà Nẵng), từ ngày 1.1.2016 đến hết tháng 1.2018, PC67 gửi 13.479 lượt thông báo đến chủ phương tiện vi phạm luật Giao thông đường bộ.",
16
+ "body": "<p class=\"body-image\"><img src=\"https://photo-1-baomoi.zadn.vn/w700_r1/18/02/05/4/24858235/1_54839.jpg\"/></p><p class=\"body-text\"><em>Xử l&yacute; phạt nguội qua camera gi&aacute;m s&aacute;t tại Ph&ograve;ng CSGT C&ocirc;ng an TP.Đ&agrave; Nẵng - Nguyễn T&uacute;</em></p>..."
17
+ "id": 24858235,
18
+ "publish": ISODate("2018-02-04T22:15:07Z"),
19
+ "tags": [],
20
+ "keywords": ["Công an TP.Đà Nẵng", "Phan Văn Thương", "Luật giao thông đường bộ", ...],
21
+ "cates": ["Pháp luật"]
22
+ }
23
+ ```
24
+ ## Dataset Preprocessing
25
+
26
+ The dataset was preprocessed as follows:
27
+
28
+ - Extracted two main components: `content` and `category`.
29
+ - `content` includes fields such as `title`, `sapo`, `body`, and `keywords`.
30
+ - `category` represents the classification labels.
31
+
32
+ - Split into train, test, and validation sets with a ratio of 70%, 15%, and 15% respectively.
33
+
34
+ ## File Structure
35
+ - `train_data_162k.json`: JSON file containing the training set.
36
+ - `test_data_162k.json`: JSON file containing the test set.
37
+ - `val_data_162k.json`: JSON file containing the validation set.
38
+ - `processed_data`: Folder containing preprocessed data (including 1 million 4 articles)
39
+ - `news_v2`: Folder containing data that has been cleaned of the `body` (including 1 million 4 articles)
40
+ - `features_162k_phobertbase.pkl`: File containing features extracted from the PhoBert model
data/test_data_162k.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b09d006f5f3f3bc50dffec503401a47475d4bf24a53fb7339c32c0fa5f1b4efa
3
+ size 78544060
data/train_data_162k.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d7810e472b4ce9aa33a2b4baf7b241a54f530e68a9536ca33f336f4aa7147215
3
+ size 368010064
data/val_data_162k.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5c6666d20f3b3f2440c7b576ac9dc12994d5f93a26313aefedb1678d853921b
3
+ size 77956028