File size: 5,675 Bytes
17cc791
d110207
 
6edcc8c
 
b6a6172
d110207
 
 
 
806db6e
 
c2ab321
17cc791
6edcc8c
dd8d2ba
806db6e
ae60e00
1ef369e
4206a1c
6edcc8c
4206a1c
ae60e00
6edcc8c
 
ae60e00
6edcc8c
ae60e00
6edcc8c
ae60e00
4206a1c
ae60e00
 
 
4206a1c
6edcc8c
 
 
 
090681b
4206a1c
6edcc8c
 
 
 
 
806db6e
6edcc8c
 
57767ac
6edcc8c
 
57767ac
6edcc8c
4206a1c
57767ac
 
6edcc8c
57767ac
 
 
 
6edcc8c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57767ac
ae60e00
4206a1c
57767ac
a9e095b
 
 
6edcc8c
57767ac
 
 
4206a1c
14e7c81
045b84c
14e7c81
9faf4eb
4206a1c
090681b
a9e095b
4206a1c
14e7c81
4206a1c
b6a6172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
language:
- zh
tags:
- SequenceClassification
- Lepton
- 古文
- 文言文
- ancient
- classical
- letter
- 书信标题
license: cc-by-nc-sa-4.0
---

# <font color="IndianRed"> LEPTON (Classical Chinese Letter Prediction)</font>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jVu2LrNwkLolItPALKGNjeT6iCfzF8Ic?usp=sharing/)

Our model <font color="cornflowerblue">LEPTON (Classical Chinese Letter Prediction) </font> is BertForSequenceClassification Classical Chinese model that is intended to predict whether a Classical Chinese sentence is <font color="IndianRed"> a letter title (书信标题) </font> or not. This model is first inherited from the BERT base Chinese model (MLM), and finetuned using a large corpus of Classical Chinese language (3GB textual dataset), then concatenated with the BertForSequenceClassification architecture to perform a binary classification task. 
 * <font color="Salmon"> Labels: 0 = non-letter, 1 = letter </font>

## <font color="IndianRed"> Model description </font>

The BertForSequenceClassification model architecture inherits the BERT base model and concatenates a fully-connected linear layer to perform a binary-class classification task.More precisely, it
was pretrained with two objectives:

- Masked language modeling (MLM): The masked language modeling architecture randomly masks 15% of the words in the inputs, and the model is trained to predict the masked words. The BERT base model uses this MLM architecture and is pre-trained on a large corpus of data. BERT is proven to produce robust word embedding and can capture rich contextual and semantic relationships. Our model inherits the publicly available pre-trained BERT Chinese model trained on modern Chinese data. To perform a Classical Chinese letter classification task, we first finetuned the model using a large corpus of Classical Chinese data (3GB textual data), and then connected it to the BertForSequenceClassification architecture for Classical Chinese letter classification.

- Sequence classification: the model concatenates a fully-connected linear layer to output the probability of each class. In our binary classification task, the final linear layer has two classes.

## <font color="IndianRed"> Intended uses & limitations </font>

Note that this model is primiarly aimed at predicting whether a Classical Chinese sentence is a letter title (书信标题) or not.

### <font color="IndianRed"> How to use </font>

Note that this model is primiarly aimed at predicting whether a Classical Chinese sentence is a letter title (书信标题) or not.

Here is how to use this model to get the features of a given text in PyTorch:

<font color="cornflowerblue"> 1. Import model and packages </font>
```python
from transformers import BertTokenizer
from transformers import BertForSequenceClassification
import torch
from numpy import exp
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertForSequenceClassification.from_pretrained('cbdb/ClassicalChineseLetterClassification',
                                                     output_attentions=False,
                                                     output_hidden_states=False)
```

<font color="cornflowerblue"> 2. Make a prediction </font>
```python
max_seq_len = 512

def softmax(vector):
	e = exp(vector)
	return e / e.sum()
 
def predict_class(test_sen):
  tokens_test = tokenizer.encode_plus(
      test_sen, 
      add_special_tokens=True, 
      return_attention_mask=True, 
      padding=True, 
      max_length=max_seq_len, 
      return_tensors='pt',
      truncation=True
  )

  test_seq = torch.tensor(tokens_test['input_ids'])
  test_mask = torch.tensor(tokens_test['attention_mask'])

  # get predictions for test data
  with torch.no_grad():
    outputs = model(test_seq, test_mask)
    outputs = outputs.logits.detach().cpu().numpy()

  softmax_score = softmax(outputs)
  pred_class_dict = {k:v for k, v in zip(label2idx.keys(), softmax_score[0])}
  return pred_class_dict

label2idx = {'not-letter': 0,'letter': 1}
idx2label = {v:k for k,v in label2idx.items()}
```

<font color="cornflowerblue"> 3. Change your sentence here </font>
```python
label2idx = {'not-letter': 0,'letter': 1}
idx2label = {v:k for k,v in label2idx.items()}

test_sen = '上丞相康思公書'
pred_class_proba = predict_class(test_sen)
print(f'The predicted probability for the {list(pred_class_proba.keys())[0]} class: {list(pred_class_proba.values())[0]}')
print(f'The predicted probability for the {list(pred_class_proba.keys())[1]} class: {list(pred_class_proba.values())[1]}')
```
<font color="IndianRed"> Output: </font> The predicted probability for the not-letter class: 0.002029061783105135

<font color="IndianRed"> Output: </font> The predicted probability for the letter class: 0.9979709386825562

```python
pred_class = idx2label[np.argmax(list(pred_class_proba.values()))]
print(f'The predicted class is: {pred_class}')
```
<font color="IndianRed"> Output: </font> The predicted class is: letter

### <font color="IndianRed">Authors </font>
Queenie Luo (queenieluo[at]g.harvard.edu)
<br>
Katherine Enright
<br>
Hongsu Wang
<br>
Peter Bol
<br>
CBDB Group

### <font color="IndianRed">License </font>
Copyright (c) 2023 CBDB

Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or
send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.