Spaces:

hienntd
/

TextClassification-PhoBERT

Running

App Files Files Community

hienntd commited on Jul 3

Commit

da1f8dc

•

1 Parent(s): c2a30b3

add data

Browse files

Files changed (4) hide show

data/README.md +40 -0
data/test_data_162k.json +3 -0
data/train_data_162k.json +3 -0
data/val_data_162k.json +3 -0

data/README.md ADDED Viewed

	@@ -0,0 +1,40 @@

+# Vietnamese News Articles Dataset
+## Overview
+This dataset consists of Vietnamese news articles collected from various Vietnamese online news portals. The dataset was originally sourced from a MongoDB dump containing over 20 million articles. From this large dataset, our team extracted approximately 162,000 articles categorized into 13 distinct categories.
+Link dataset: https://github.com/binhvq/news-corpus
+## Sample Data
+Here is an example of the original data structure:
+```json
+{
+    "source": "Thanh Niên",
+    "title": "Đà Nẵng nghiên cứu tiện ích nhắn tin khi vi phạm đến chủ phương tiện",
+    "sapo": "Theo thống kê của Phòng CSGT (PC67, Công an TP.Đà Nẵng), từ ngày 1.1.2016 đến hết tháng 1.2018, PC67 gửi 13.479 lượt thông báo đến chủ phương tiện vi phạm luật Giao thông đường bộ.",
+    "body": "<p class=\"body-image\"><img src=\"https://photo-1-baomoi.zadn.vn/w700_r1/18/02/05/4/24858235/1_54839.jpg\"/></p><p class=\"body-text\"><em>Xử l&yacute; phạt nguội qua camera gi&aacute;m s&aacute;t tại Ph&ograve;ng CSGT C&ocirc;ng an TP.Đ&agrave; Nẵng - Nguyễn T&uacute;</em></p>..."
+    "id": 24858235,
+    "publish": ISODate("2018-02-04T22:15:07Z"),
+    "tags": [],
+    "keywords": ["Công an TP.Đà Nẵng", "Phan Văn Thương", "Luật giao thông đường bộ", ...],
+    "cates": ["Pháp luật"]
+}
+```
+## Dataset Preprocessing
+The dataset was preprocessed as follows:
+- Extracted two main components: `content` and `category`.
+  - `content` includes fields such as `title`, `sapo`, `body`, and `keywords`.
+  - `category` represents the classification labels.
+- Split into train, test, and validation sets with a ratio of 70%, 15%, and 15% respectively.
+## File Structure
+- `train_data_162k.json`: JSON file containing the training set.
+- `test_data_162k.json`: JSON file containing the test set.
+- `val_data_162k.json`: JSON file containing the validation set.
+- `processed_data`: Folder containing preprocessed data (including 1 million 4 articles)
+- `news_v2`: Folder containing data that has been cleaned of the `body` (including 1 million 4 articles)
+- `features_162k_phobertbase.pkl`: File containing features extracted from the PhoBert model

data/test_data_162k.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b09d006f5f3f3bc50dffec503401a47475d4bf24a53fb7339c32c0fa5f1b4efa
+size 78544060

data/train_data_162k.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d7810e472b4ce9aa33a2b4baf7b241a54f530e68a9536ca33f336f4aa7147215
+size 368010064

data/val_data_162k.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5c6666d20f3b3f2440c7b576ac9dc12994d5f93a26313aefedb1678d853921b
+size 77956028