satyaalmasian commited on
Commit
f2dfb20
1 Parent(s): ece9d45

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BERT based temporal tagged
2
+
3
+ Token classifier for temporal tagging of plain text using BERT language model with extra date embedding for reference date of the document. The model is introduced in the paper BERT got a Date: Introducing Transformers to Temporal Tagging and release in this [repository](https://github.com/satya77/Transformer_Temporal_Tagger).
4
+
5
+ # Model description
6
+ BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. We use BERT for token classification to tag the tokens in text with classes:
7
+ ```
8
+ O -- outside of a tag
9
+ I-TIME -- inside tag of time
10
+ B-TIME -- beginning tag of time
11
+ I-DATE -- inside tag of date
12
+ B-DATE -- beginning tag of date
13
+ I-DURATION -- inside tag of duration
14
+ B-DURATION -- beginning tag of duration
15
+ I-SET -- inside tag of the set
16
+ B-SET -- beginning tag of the set
17
+ ```
18
+ This model is similar to `satyaalmasian/temporal_tagger_BERT_tokenclassifier ` but contains an additional date embedding layer for the reference date of the document. If you data contains such information, this model is preferred.
19
+
20
+ # Intended uses & limitations
21
+ This model is best used accompanied with code from the [repository](https://github.com/satya77/Transformer_Temporal_Tagger). Especially for inference, the direct output might be noisy and hard to decipher, in the repository we provide alignment functions and voting strategies for the final output.
22
+
23
+ # How to use
24
+ you can load the model as follows:
25
+ ```
26
+ tokenizer = AutoTokenizer.from_pretrained("satyaalmasian/temporal_tagger_DATEBERT_tokenclassifier", use_fast=False)
27
+ model = BertForTokenClassification.from_pretrained("satyaalmasian/temporal_tagger_DATEBERT_tokenclassifier")
28
+ date_tokenizer = NumBertTokenizer("../data/vocab_date.txt")# from the repositoy
29
+
30
+ ```
31
+ for inference use:
32
+ ```
33
+ processed_date = torch.LongTensor(date_tokenizer(date_input, add_special_tokens=False)["input_ids"])
34
+ processed_text = tokenizer(input_text, return_tensors="pt")
35
+ processed_text["input_date_ids"]=processed_date
36
+ result = model(**processed_text)
37
+ classification= result[0]
38
+
39
+ ```
40
+ for an example with post-processing, refer to the [repository](https://github.com/satya77/Transformer_Temporal_Tagger).
41
+ We provide a function `merge_tokens` to decipher the output.
42
+ to further fine-tune, use the `Trainer` from hugginface. An example of a similar fine-tuning can be found [here](https://github.com/satya77/Transformer_Temporal_Tagger/blob/master/run_token_classifier.py).
43
+
44
+ #Training data
45
+ We use 3 data sources:
46
+ [Tempeval-3](https://www.cs.york.ac.uk/semeval-2013/task1/index.php%3Fid=data.html), Wikiwars, Tweets datasets. For the correct data versions please refer to our [repository](https://github.com/satya77/Transformer_Temporal_Tagger).
47
+
48
+ #Training procedure
49
+ The model is trained from publicly available checkpoints on huggingface (`bert-base-uncased`), with a batch size of 34. We use a learning rate of 5e-05 with an Adam optimizer and linear weight decay.
50
+ We fine-tune with 5 different random seeds, this version of the model is the only seed=19.
51
+ For training, we use 2 NVIDIA A100 GPUs with 40GB of memory.
52
+
53
+
54
+
55
+