--- language: - zh tags: - LinkTransformer - Office Title Disambiguation/Similarity - 古代官职 - 古文 - 文言文 - ancient - classical chinese license: cc-by-nc-sa-4.0 --- # OfficeTitleDis (Classical Chinese Office Title Disambiguation/Similarity) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ql7NkLOGdEf2IaPg_9khGxev3OkZIaXu?usp=sharing) This model has been fine-tuned using methodologies from the paper ["LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models"](https://scholar.harvard.edu/sites/scholar.harvard.edu/files/dell/files/linkt.pdf) by Abhishek Arora and Melissa Dell from Harvard University. ### Model Description This model is designed to find the top \(N\) most similar Classical Chinese office titles in a given data frame. Given an input DataFrame containing \(K\) office titles, the model outputs the top \(N\) most similar office titles in the input DataFrame for every office title. ### Fine-tuning Data The data used for fine-tuning this model is supported by the China Biographical Database (CBDB) at Harvard University. All office titles from the training data are from the periods of the Song, Ming, and Qing dynasties. --- ### Usage The following section demonstrates how to directly load the OfficeTitleDis model. Please ensure that you have the necessary libraries installed and model downloaded in your Python environment. If not, you can install it using pip: ```python git lfs install git clone https://huggingface.co/cbdb/OfficeTitleDis pip install linktransformer pip install hanziconv ``` Now, let's load our model and make some predictions: ```python # Import necessary libraries from linktransformer import linktransformer as lt # predict df_lm_matched = lt.merge(df1, df2, merge_type='1:m', on="office_name", model="/content/OfficeTitleDis/model", left_on=None, right_on=None) display(df_lm_matched.head()) ``` --- ### Authors Queenie Luo (queenieluo[at]g.harvard.edu)
Hongsu Wang
Peter Bol
CBDB Group ### License Copyright (c) 2023 CBDB Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.