--- license: cc-by-sa-4.0 datasets: - bigbio/cas language: - fr metrics: - f1 - precision - recall library_name: transformers tags: - biomedical - clinical - pytorch - camembert pipeline_tag: token-classification inference: false --- # Privacy-preserving mimic models for clinical named entity recognition in French In this [paper](https://doi.org/10.1016/j.jbi.2022.104073), we propose a Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach. The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data. The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared without sharing the data itself or exposing the *private teacher model* that was directly built on this data. # CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model To generate the CAS Privacy-Preserving Mimic Model, we used a *private teacher model* to annotate the unlabeled [CAS clinical French corpus](https://aclanthology.org/W18-5614/). The *private teacher model* is an NER model trained on the [MERLOT clinical corpus](https://link.springer.com/article/10.1007/s10579-017-9382-y) and could not be shared. Using the produced [silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model. This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner. We share only the weights of the CAS *student model*, which is trained on silver-labeled publicly released data. We argue that no potential attack could reveal information about sensitive private data using the silver annotations generated by the *private teacher model* on publicly available non-sensitive data. Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language structuring ([NLstruct](https://github.com/percevalw/nlstruct)) library that implements NER models that handle nested entities. - **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073) - **Produced gold and silver annotations for the [DEFT](https://deft.lisn.upsaclay.fr/2020/) and [CAS](https://aclanthology.org/W18-5614/) French clinical corpora:** https://zenodo.org/records/6451361 - **Developed by:** [Nesrine Bannour](https://github.com/NesrineBannour), [Perceval Wajsbürt](https://github.com/percevalw), [Bastien Rance](https://team.inria.fr/heka/fr/team-members/rance/), [Xavier Tannier](http://xavier.tannier.free.fr/) and [Aurélie Névéol](https://perso.limsi.fr/neveol/) - **Language:** French - **License:** cc-by-sa-4.0 # Download the CAS Privacy-Preserving NER Mimic Model ```python fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt") urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1]) model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt") urllib.request.urlretrieve(model_url, "path/to/your/folder/"+ model_url.split('/')[-1]) path_checkpoint = "path/to/your/folder/"+ model_url.split('/')[-1] ``` ## 1. Load and use the model using only NLstruct [NLstruct](https://github.com/percevalw/nlstruct) is the Python library we used to generate our CAS privacy-preserving NER mimic model and that handles nested entities. ### Install the NLstruct library ``` pip install nlstruct==0.1.0 ``` ### Use the model ```python from nlstruct import load_pretrained from nlstruct.datasets import load_from_brat, export_to_brat ner_model = load_pretrained(path_checkpoint) test_data = load_from_brat("path/to/brat/test") test_predictions = ner_model.predict(test_data) # Export the predictions into the BRAT standoff format export_to_brat(test_predictions, filename_prefix="path/to/exported_brat") ``` ## 2. Load the model using NLstruct and use it with the Medkit library [Medkit](https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data, including textual data. ### Install the Medkit library ``` python -m pip install 'medkit-lib' ``` ### Use the model Our model could be implemented as a Medkit operation module as follows: ```python import os from nlstruct import load_pretrained import urllib.request from huggingface_hub import hf_hub_url from medkit.io.brat import BratInputConverter, BratOutputConverter from medkit.core import Attribute from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils class CAS_matcher(NEROperation): def __init__(self): # Load the fasttext file fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt") if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"): urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1]) # Load the model model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt") if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"): urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1]) path_checkpoint = "ner_model/"+ model_url.split('/')[-1] self.model = load_pretrained(path_checkpoint) self.model.eval() def run(self, segments): """Return entities for each match in `segments`. Parameters ---------- segments: List of segments into which to look for matches. Returns ------- List[Entity] Entities found in `segments`. """ # get an iterator to all matches, grouped by segment entities = [] for segment in segments: matches = self.model.predict({"doc_id":segment.uid,"text":segment.text}) entities.extend([entity for entity in self._matches_to_entities(matches, segment) ]) return entities def _matches_to_entities(self, matches, segment: Segment): for match in matches["entities"]: text_all,spans_all = [],[] for fragment in match["fragments"]: text, spans = span_utils.extract( segment.text, segment.spans, [(fragment["begin"], fragment["end"])] ) text_all.append(text) spans_all.extend(spans) text_all = "".join(text_all) entity = Entity( label=match["label"], text=text_all, spans=spans_all, ) score_attr = Attribute( label="confidence", value=float(match["confidence"]), #metadata=dict(model=self.model.path_checkpoint), ) entity.attrs.add(score_attr) yield entity brat_converter = BratInputConverter() docs = brat_converter.load("path/to/brat/test") matcher = CAS_matcher() for doc in docs: entities = matcher.run([doc.raw_segment]) for ent in entities: doc.anns.add(ent) brat_output_converter = BratOutputConverter(attrs=[]) # To keep the same document names in the output folder doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs] brat_output_converter.save(docs, dir_path="path/to/exported_brat, doc_names=doc_names) ``` ## Environmental Impact Carbon emissions are estimated using the [Carbontracker](https://github.com/lfwa/carbontracker) tool. The used version at the time of our experiments computes its estimates by using the average carbon intensity in European Union in 2017 instead of the France value (294.21 gCO2eq/kWh vs. 85 gCO2eq/kWh). Therefore, our reported carbon footprint of training both the private model that generated the silver annotations and the CAS student model is overestimated. - **Hardware Type:** GPU NVIDIA GTX 1080 Ti - **Compute Region:** Gif-sur-Yvette, Île-de-France, France - **Carbon Emitted:** 292 gCO2eq ## Acknowledgements We thank the institutions and colleagues who made it possible to use the datasets described in this study: the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus, and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank the ITMO Cancer Aviesan for funding our research, and the [HeKA research team](https://team.inria.fr/heka/) for integrating our model into their library [Medkit]((https://github.com/TeamHeka/medkit)). ## Citation If you use this model in your research, please make sure to cite our paper: ```bibtex @article{BANNOUR2022104073, title = {Privacy-preserving mimic models for clinical named entity recognition in French}, journal = {Journal of Biomedical Informatics}, volume = {130}, pages = {104073}, year = {2022}, issn = {1532-0464}, doi = {https://doi.org/10.1016/j.jbi.2022.104073}, url = {https://www.sciencedirect.com/science/article/pii/S1532046422000892}} } ```