File size: 4,179 Bytes
17cc791 d110207 6edcc8c d110207 c2ab321 17cc791 6edcc8c ae60e00 d110207 6edcc8c ae60e00 6edcc8c ae60e00 6edcc8c ae60e00 6edcc8c ae60e00 6edcc8c ae60e00 6edcc8c 57767ac 6edcc8c 57767ac 6edcc8c 57767ac 6edcc8c 57767ac 6edcc8c 57767ac 6edcc8c 57767ac ae60e00 57767ac 6edcc8c 57767ac 6edcc8c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
---
language:
- zh
tags:
- SequenceClassification
- 古文
- 文言文
- ancient
- classical
license: cc-by-nc-sa-4.0
---
# BertForSequenceClassification model (Classical Chinese)
This BertForSequenceClassification Classical Chinese model is intended to predict whether a Classical Chinese sentence is a letter title (书信标题) or not. This model is first inherited from the BERT base Chinese model (MLM), and finetuned using a large corpus of Classical Chinese language (3GB textual dataset), then concatenated with the BertForSequenceClassification architecture to perform a binary classification task.
#### Labels: 0 = non-letter, 1 = letter
## Model description
The BertForSequenceClassification model architecture inherits the BERT base model and concatenates a fully-connected linear layer to perform a binary-class classification task.More precisely, it
was pretrained with two objectives:
- Masked language modeling (MLM): The masked language modeling architecture randomly masks 15% of the words in the inputs, and the model is trained to predict the masked words. The BERT base model uses this MLM architecture and is pre-trained on a large corpus of data. BERT is proven to produce robust word embedding and can capture rich contextual and semantic relationships. Our model inherits the publicly available pre-trained BERT Chinese model trained on modern Chinese data. To perform a Classical Chinese letter classification task, we first finetuned the model using a large corpus of Classical Chinese data (3GB textual data), and then connected it to the BertForSequenceClassification architecture for Classical Chinese letter classification.
- Sequence classification: the model concatenates a fully-connected linear layer to output the probability of each class. In our binary classification task, the final linear layer has two classes.
## Intended uses & limitations
Note that this model is primiarly aimed at predicting whether a Classical Chinese sentence is a letter title (书信标题) or not.
### How to use
Note that this model is primiarly aimed at predicting whether a Classical Chinese sentence is a letter title (书信标题) or not.
Here is how to use this model to get the features of a given text in PyTorch:
1. Import model
```python
from transformers import BertTokenizer
from transformers import BertForSequenceClassification
import torch
from numpy import exp
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertForSequenceClassification.from_pretrained('cbdb/ClassicalChineseLetterClassification',
output_attentions=False,
output_hidden_states=False)
```
2. Make a prediction
```python
max_seq_len = 512
def softmax(vector):
e = exp(vector)
return e / e.sum()
def predict_class(test_sen):
tokens_test = tokenizer.encode_plus(
test_sen,
add_special_tokens=True,
return_attention_mask=True,
padding=True,
max_length=max_seq_len,
return_tensors='pt',
truncation=True
)
test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
# get predictions for test data
with torch.no_grad():
outputs = model(test_seq, test_mask)
outputs = outputs.logits.detach().cpu().numpy()
softmax_score = softmax(outputs)
pred_class_dict = {k:v for k, v in zip(label2idx.keys(), softmax_score[0])}
return pred_class_dict
label2idx = {'not-letter': 0,'letter': 1}
idx2label = {v:k for k,v in label2idx.items()}
```
3. Change your sentence here
```python
test_sen = '上丞相康思公書'
pred_class_proba = predict_class(test_sen)
print(f'The predicted probability for the {list(pred_class_proba.keys())[0]} class: {list(pred_class_proba.values())[0]}')
print(f'The predicted probability for the {list(pred_class_proba.keys())[1]} class: {list(pred_class_proba.values())[1]}')
pred_class = idx2label[np.argmax(list(pred_class_proba.values()))]
print(f'The predicted class is: {pred_class}')test_sen = '上丞相康思公書'
pred_class_dict = predict_class(test_sen)
print(pred_class_dict)
``` |