cbdb
/

ClassicalChineseLetterClassification

@@ -1,23 +1,79 @@
 ---
 license: mit
 ---
-**BertForSequenceClassification model (Classical Chinese)**
 This BertForSequenceClassification Classical Chinese model is intended to predict whether a Classical Chinese sentence is a letter title (书信标题) or not. This model is first inherited from the BERT base Chinese model (MLM), and finetuned using a large corpus of Classical Chinese language (3GB textual dataset), then concatenated with the BertForSequenceClassification architecture to perform a binary classification task.
-**Labels: 0 = non-letter, 1 = letter**
-**Model description**
-The BertForSequenceClassification model architecture inherits the BERT base model and concatenates a fully-connected linear layer to perform a binary-class classification task.
-**Masked language modeling (MLM):** The masked language modeling architecture randomly masks 15% of the words in the inputs, and the model is trained to predict the masked words. The BERT base model uses this MLM architecture and is pre-trained on a large corpus of data. BERT is proven to produce robust word embedding and can capture rich contextual and semantic relationships. Our model inherits the publicly available pre-trained BERT Chinese model trained on modern Chinese data. To perform a Classical Chinese letter classification task, we first finetuned the model using a large corpus of Classical Chinese data (3GB textual data), and then connected it to the BertForSequenceClassification architecture for Classical Chinese letter classification.
-**Sequence classification:** the model concatenates a fully-connected linear layer to output the probability of each class. In our binary classification task, the final linear layer has two classes.
-**Intended uses & limitations**
 Note that this model is primiarly aimed at predicting whether a Classical Chinese sentence is a letter title (书信标题) or not.
-**How to use**
-You can use this model directly with a pipeline for masked language modeling:

 ---
+language: en
+tags:
+- SequenceClassification
 license: mit
 ---
+# BertForSequenceClassification model (Classical Chinese)
 This BertForSequenceClassification Classical Chinese model is intended to predict whether a Classical Chinese sentence is a letter title (书信标题) or not. This model is first inherited from the BERT base Chinese model (MLM), and finetuned using a large corpus of Classical Chinese language (3GB textual dataset), then concatenated with the BertForSequenceClassification architecture to perform a binary classification task.
+### Labels: 0 = non-letter, 1 = letter
+## Model description
+The BertForSequenceClassification model architecture inherits the BERT base model and concatenates a fully-connected linear layer to perform a binary-class classification task.More precisely, it
+was pretrained with two objectives:
+- Masked language modeling (MLM): The masked language modeling architecture randomly masks 15% of the words in the inputs, and the model is trained to predict the masked words. The BERT base model uses this MLM architecture and is pre-trained on a large corpus of data. BERT is proven to produce robust word embedding and can capture rich contextual and semantic relationships. Our model inherits the publicly available pre-trained BERT Chinese model trained on modern Chinese data. To perform a Classical Chinese letter classification task, we first finetuned the model using a large corpus of Classical Chinese data (3GB textual data), and then connected it to the BertForSequenceClassification architecture for Classical Chinese letter classification.
+- Sequence classification: the model concatenates a fully-connected linear layer to output the probability of each class. In our binary classification task, the final linear layer has two classes.
+## Intended uses & limitations
 Note that this model is primiarly aimed at predicting whether a Classical Chinese sentence is a letter title (书信标题) or not.
+### How to use
+Note that this model is primiarly aimed at predicting whether a Classical Chinese sentence is a letter title (书信标题) or not.
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import BertTokenizer
+from transformers import BertForSequenceClassification
+import torch
+from numpy import exp
+tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
+model_path = '/content/drive/MyDrive/CBDB/Letter_Classifier/model/letter_classifer_epoch2' # here
+model = BertForSequenceClassification.from_pretrained(model_path,
+                                                     output_attentions=False,
+                                                     output_hidden_states=False)
+def softmax(vector):
+  e = exp(vector)
+  return e / e.sum()
+def predict_class(test_sen):
+  tokens_test = tokenizer.encode_plus(
+      test_sen,
+      add_special_tokens=True,
+      return_attention_mask=True,
+      padding=True,
+      max_length=max_seq_len,
+      return_tensors='pt',
+      truncation=True
+  )
+  test_seq = torch.tensor(tokens_test['input_ids'])
+  test_mask = torch.tensor(tokens_test['attention_mask'])
+  # get predictions for test data
+  with torch.no_grad():
+    outputs = model(test_seq, test_mask)
+    outputs = outputs.logits.detach().cpu().numpy()
+  softmax_score = softmax(outputs)
+  pred_class_dict = {k:v for k, v in zip(label2idx.keys(), softmax_score[0])}
+  return pred_class_dict
+max_seq_len = 512
+label2idx = {'not-letter': 0,'letter': 1}
+idx2label = {v:k for k,v in label2idx.items()}
+test_sen = '上丞相康思公書'
+pred_class_dict = predict_class(test_sen)
+print(pred_class_dict)
+```