Bictole commited on
Commit
a7a1ef5
1 Parent(s): dd09599

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -1,3 +1,40 @@
1
  ---
 
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: mit
4
+ datasets:
5
+ - imdb
6
  ---
7
+
8
+ # NLP Deep 2
9
+
10
+ Fine Tuned model from [distilbert-base](https://huggingface.co/distilbert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.) on the IMDB dataset.
11
+
12
+ ## Model description
13
+
14
+ DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model.
15
+
16
+
17
+ ## Training data
18
+
19
+ The NLP Deep 2 model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
20
+ unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
21
+ headers).
22
+
23
+ It was fine tuned on [IMDB](https://arxiv.org/abs/2005.14147) dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.
24
+
25
+ ## Training procedure
26
+
27
+ ### Preprocessing
28
+
29
+ The texts are tokenized using **DistilBertTokenizerFast**. The inputs of the model are then of the form:
30
+
31
+ ```
32
+ [CLS] Sentence A [SEP] Sentence B [SEP]
33
+ ```
34
+
35
+
36
+
37
+ ## Evaluation results
38
+
39
+
40
+ // TODO