gagan3012 commited on
Commit
4a880d6
1 Parent(s): fc13a4d

Create blog.md

Browse files
Files changed (1) hide show
  1. blog/blog.md +106 -0
blog/blog.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # T5S
2
+
3
+ Natural Language Processing is one of the key areas where Machine Learning has been very effective. In fact, whereas NLP traditionally required a lot of human intervention, today, this is no longer true. Specifically, Deep Learning technology can be used for learning tasks related to language, such as translation, classification, entity recognition or in this case, summarization.
4
+
5
+ Summarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.
6
+
7
+ There are important applications for text summarization in various NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, and headline generation. Moreover, the generation of summaries can be integrated into these systems as an intermediate stage which helps to reduce the length of the document.
8
+
9
+ Our goal here was to create a reproducible pipeline for Text summarisation. We wanted to train the model, visualise the results and upload the model. This project is built using the DVC cookiecutter template provided by DAGsHub.
10
+
11
+ The package for text summarization is available to be downloaded as package
12
+
13
+ ```
14
+ pip install t5s
15
+ ```
16
+
17
+ ## Pipeline
18
+
19
+ Once we download the package we can use the training pipeline. Before we describe how the package works we will explain what each stage of the pipeline is.
20
+
21
+ ![image](https://user-images.githubusercontent.com/49101362/129772732-438e700b-b0f0-4a74-832e-27628d8c2da3.png)
22
+
23
+ The first stage of our pipeline is to download data from the hugging face hub. Here for training we have used the CNN_dailymail dataset. In order to download the dataset we use the parameter files that is data_params.yml which defines the datasets and the split that we would like to train our data on. We run the download_data stage which downloads the data and then stores it as raw data which we will then process.
24
+
25
+ Once the raw data is saved we move on to processing the data using our script to process the raw data. We change the column names and modify the data to work with our training script. Now the data is also split into three different files: train.csv, validation.csv and test.csv.
26
+
27
+ Now we can move on to training the model. The code for training the model has been written in pytorch lightning. The script allows us to train T5, mT5 and byT5 models as well. All the script parameters can be controlled using the model_params.yml file. The training stage returns the model that can be saved and also the training metrics which are logged using MLflow and DAGsHub.
28
+
29
+ Next we need to evaluate the model that has been created and to do so we need to use the rouge metric which uses the test datasets to evaluate the model. The evaluation metrics are also saved using DAGsHub. Once we commit all the models to git we can evaluate our models from the DAGsHub repo.
30
+
31
+ ![image](https://user-images.githubusercontent.com/49101362/129772801-063ec2fd-feb2-401b-ab9c-0d9250447d1a.png)
32
+
33
+
34
+ We can also visualise and test the results of the model using a streamlit app which can be accessed using Hugging Face spaces. We also have the option of running the upload script and uploading the model to Hugging Face Hub too.
35
+
36
+ ![image](https://user-images.githubusercontent.com/49101362/129772845-8a93b3ce-ad6b-44ce-aa41-0b6da65a8ac4.png)
37
+
38
+ ## T5S CLI
39
+
40
+ In order to run the pipeline we have setup a CLI application that will help us run the pipeline
41
+
42
+ To install the pipeline we need to first install t5s as
43
+
44
+ ```
45
+ pip install t5s
46
+ ```
47
+
48
+ Firstly we need to clone the repo containing the code so we can do that using before cloning make sure you have forked the code from the main repo so it would be faster to push and pull
49
+ ```
50
+ t5s clone [-h] [-u USERNAME]
51
+ ```
52
+
53
+ We would then have to create the required directories to run the pipeline
54
+
55
+ ```
56
+ t5s dirs
57
+ ```
58
+
59
+ Now to define the parameters for the run we have to run:
60
+ ```
61
+ t5s start [-h] [-d DATASET] [-s SPLIT] [-n NAME] [-mt MODEL_TYPE]
62
+ [-m MODEL_NAME] [-e EPOCHS] [-lr LEARNING_RATE]
63
+ [-b BATCH_SIZE]
64
+ ```
65
+ Then we need to pull the models from DVC
66
+
67
+ ```
68
+ t5s pull
69
+ ```
70
+
71
+ Now to run the training pipeline we can run:
72
+
73
+ ```
74
+ t5s run
75
+ ```
76
+
77
+ Before pushing make sure that the DVC remote is setup correctly:
78
+
79
+ ```
80
+
81
+ dvc remote modify origin url https://dagshub.com/{user_name}/summarization.dvc
82
+ dvc remote modify origin --local auth basic
83
+ dvc remote modify origin --local user {user_name}
84
+ dvc remote modify origin --local password {your_token}
85
+
86
+ ```
87
+ Finally to push the model to DVC
88
+
89
+ ```
90
+ t5s push
91
+ ```
92
+
93
+ To push this model to HuggingFace Hub for inference you can run:
94
+
95
+ ```
96
+ t5s upload
97
+ ```
98
+
99
+ Next if we would like to test the model and visualise the results we can run:
100
+
101
+ ```
102
+ t5s visualize
103
+ ```
104
+
105
+
106
+