File size: 626 Bytes
ee21b96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

# Install dependency
```bash
pip install -r requirement.txt
```

# Download the data set
```bash
export WORKDIR_ROOT=<a directory which will hold all working files>

```
The downloaded data will be at $WORKDIR_ROOT/ML50

# preprocess the data
Install SPM [here](https://github.com/google/sentencepiece)
```bash
export WORKDIR_ROOT=<a directory which will hold all working files>
export SPM_PATH=<a path pointing to sentencepice spm_encode.py>
```
* $WORKDIR_ROOT/ML50/raw: extracted raw data
* $WORKDIR_ROOT/ML50/dedup: dedup data
* $WORKDIR_ROOT/ML50/clean: data with valid and test sentences removed from the dedup data