Anonymization of training data

#3
by cyfis - opened

Hello there

First of all, thanks a lot for making this model publicly available! I am very interested in how you performed the anonymization and removal of patient context in the training data. I've read that you used an entitiy recognition model for this. Could you give me any insight into which model was used and if you had to fine tune the ENR model to function with medical texts?

Thanks for your answer!

German MedBERT Initiative org

Hey,

we are fortunate that patient information is not typically saved in the report texts but in a separate file. So it was easy to remove. However, we still removed names of doctors from the data.

  1. We used FLAIR NLP to identify names and then manually checked the least frequent entries (most frequent entries were no names but other nouns).
  2. We then also used RegEx to remove names. Doctors are usually referred to as "Dr. XYZ" or "Prof. XYZ", "OA XYZ". So a simple RegEx could identify most cases.

Together these methods allowed us to achieve a satisfactory anonymization of the texts.

Hello, first thank you very much for sharing the model. i have a problem with using the model and we i want to fine on model i have this error:
"... in load_state_dict
with safe_open(checkpoint_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge"

could you please help me in solving this error.

Sign up or log in to comment