Model Card for Model ID

The OpenDeid AICUP Suite is a collection of models developed to facilitate deidentification and temporal normalization research (see paper). It contains sets of eight models of sizes 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B based on the Pythia Scaling Suite. For the 70m size, there are two sets of models: one trained on the original OpenDeid-AICUP corpus, and one trained on the corpus generated by the previous model.

Model Details

Model Description

This model is trained on the full OpenDeid-AICUP corpus released in the ACIUP 2023 competition.

Developed by: ISLab
Model type: Transformer-based Language Model
Language: English
License: Apache 2.0
Finetuned from model: EleutherAI/pythia-160m

Model Sources

Repository: ISLab-git
Paper: [More Information Needed]
Demo: [More Information Needed]

Uses

The primary intended use of the OpenDeid AICUP Suite is research on the behavior, functionality, and limitations of large language models for the deidentification and normalization tasks proposed in the ACIUP 2023 competition. This suite is intended to provide a controlled setting for performing scientific experiments.

The models in the suite work with the Hugging Face Transformers Library. You may also further fine-tune and adapt the model for deployment, as long as your use is in accordance with the Apache 2.0 license and conduct your own risk and bias assessment.

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

Similar to the original Pythia Suite, the OpenDeid AICUP Suite is not intended for deployment. It is not a in itself a product and cannot be used for human-facing interactions. For example, the model may generate harmful or offensive text. Please evaluate the risks associated with your particular use case.

The OpenDeid models are English-language only, and are not suitable for translation or generating text in other languages.

OpenDeid-160M has been fine-tuned for the sensitive health information recognition and normalization tasks based on a pre-defined format. This means the OpenDeid AICUP Suite will not respond to a given prompt the way a product like ChatGPT does, which was fine-tuned using methods such as Reinforcement Learning from Human Feedback (RLHF) to better “follow” human instructions.

Bias, Risks, and Limitations

This OpenDeid AICUP models are based on the Pythia models which were pre-trained on the Pile and further fine-tuned on the OpenDeid AICUP corpus, a dataset compiled for the sensitive health information and normalization tasks. The fine-tuned models tend to generate outputs in the manner of a pre-defined output layout which may not suiable for downstream tasks like text summarization or translation.

How to Get Started with the Model

Use the code (based on vLLM) below to get started with the model.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model = LLM('ISLabResearch/opendeid-160m-ft-full')
seed = 309
tokenizer = AutoTokenizer.from_pretrained(Name)
eos = tokenizer.eos_token

params = SamplingParams(max_tokens = 50, include_stop_str_in_output = True, temperature = 0,  
                                     ignore_eos = False, stop = [eos], seed=seed)
preds = model.generate("Hello", params, use_tqdm = False)

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

Sensitive Health Information Recognition Results

Coding Type	Precision	Recall	F-measure	Support
PATIENT	0.6842105	0.0726257	0.1313131	716
IDNUM	0.7586207	0.3216981	0.4518052	2120
DATE	0.9660215	0.9133794	0.9389632	2459
MEDICALRECORD	0.6986667	0.3507363	0.4670232	747
CITY	0.9688473	0.8337802	0.8962536	373
STATE	0.9814242	0.9548193	0.967939	332
ZIP	0.9326923	0.5495751	0.6916221	353
DEPARTMENT	0.8773585	0.6658711	0.7571235	419
HOSPITAL	0.8676996	0.6076795	0.7147766	1198
DOCTOR	0.7387964	0.1734295	0.2809153	3327
STREET	0.7209302	0.09011628	0.1602067	344
DURATION	0.8333333	0.4166667	0.5555555	12
TIME	0.7876448	0.4340425	0.5596708	470
SET	0.6666667	0.4	0.5	5
AGE	0.8958333	0.8431373	0.8686869	51
LOCATION-OTHER	1	0.1666667	0.2857143	6
ORGANIZATION	0.1304348	0.04054054	0.06185567	74
PHONE	0	0	0	1
Micro-avg. F	0.8669685	0.4564465	0.5980358	13007
Macro-avg. F	0.75051	0.4352647	0.5509824	13007

Temporal Information Normalization Results

Temporal Type	Precision	Recall	F-measure	Support
DATE	0.8036509	0.7340382	0.7672688	2459
DURATION	1	0.4166667	0.5882353	12
TIME	0.4901961	0.212766	0.2967359	470
SET	1	0.4	0.5714286	5
Micro-avg.	0.7781848	0.6490156	0.707755	2946
Macro-avg.	0.8234617	0.4408677	0.574277	2946

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

ISLabResearch
/

opendeid-160m-ft-full

Model Card for Model ID

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Preprocessing [optional]

Training Hyperparameters

Speeds, Sizes, Times [optional]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Factors

Metrics

Results

Sensitive Health Information Recognition Results

Temporal Information Normalization Results

Citation [optional]

More Information [optional]

Model Card Authors [optional]

Model Card Contact

Collection including ISLabResearch/opendeid-160m-ft-full

OpenDeid AICUP Suite