File size: 2,658 Bytes
3f09c03
159886a
 
 
 
 
 
 
 
 
c342182
3f09c03
 
159886a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
language:
- zh
tags:
- LinkTransformer
- Office Title Disambiguation/Similarity
- 古代官职
- 古文
- 文言文
- ancient
- classical chinese
license: cc-by-nc-sa-4.0
---

# <font color="IndianRed"> OfficeTitleDis (Classical Chinese Office Title Disambiguation/Similarity)</font>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ql7NkLOGdEf2IaPg_9khGxev3OkZIaXu?usp=sharing)

This model has been fine-tuned using methodologies from the paper ["LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models"](https://scholar.harvard.edu/sites/scholar.harvard.edu/files/dell/files/linkt.pdf) by Abhishek Arora and Melissa Dell from Harvard University. 

### <font color="IndianRed">Model Description </font>
This model is designed to find the top \(N\) most similar Classical Chinese office titles in a given data frame. Given an input DataFrame containing \(K\) office titles, the model outputs the top \(N\) most similar office titles in the input DataFrame for every office title. 

### <font color="IndianRed">Fine-tuning Data </font>
The data used for fine-tuning this model is supported by the China Biographical Database (CBDB) at Harvard University. All office titles from the training data are from the periods of the Song, Ming, and Qing dynasties.

--- 

### <font color="IndianRed">Usage</font>

The following section demonstrates how to directly load the OfficeTitleDis model.

Please ensure that you have the necessary libraries installed and model downloaded in your Python environment. If not, you can install it using pip:

```python
git lfs install
git clone https://huggingface.co/cbdb/OfficeTitleDis
pip install linktransformer
pip install hanziconv
```

Now, let's load our model and make some predictions:

```python
# Import necessary libraries from linktransformer
import linktransformer as lt

# predict
df_lm_matched = lt.merge(df1, df2, merge_type='1:m', on="office_name", model="/content/OfficeTitleDis/model", left_on=None, right_on=None)
display(df_lm_matched.head())
```
---


### <font color="IndianRed">Authors </font>
Queenie Luo (queenieluo[at]g.harvard.edu)
<br>
Hongsu Wang
<br>
Peter Bol
<br>
CBDB Group

### <font color="IndianRed">License </font>
Copyright (c) 2023 CBDB

Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or
send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.