---
license: cc-by-sa-4.0
datasets:
- bigbio/cas
language:
- fr
metrics:
- f1
- precision
- recall
library_name: transformers
tags:
  - biomedical
  - clinical
  - pytorch
  - camembert
pipeline_tag: token-classification
inference: false
---
# Privacy-preserving mimic models for clinical named entity recognition in French

<!-- ## Paper abstract -->
In this [paper](https://doi.org/10.1016/j.jbi.2022.104073), we propose a 
Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach. 
The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data. 
The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared 
without sharing the data itself or exposing the *private teacher model* that was directly built on this data.

# CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model

<!-- Provide a quick summary of what the model is/does. -->

To generate the CAS Privacy-Preserving Mimic Model, we used a *private teacher model* to annotate the unlabeled 
[CAS clinical French corpus](https://aclanthology.org/W18-5614/). The *private teacher model* is an NER model trained on the 
[MERLOT clinical corpus](https://link.springer.com/article/10.1007/s10579-017-9382-y) and could not be shared. Using the produced 
[silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.

We share only the weights of the CAS *student model*, which is trained on silver-labeled publicly released data.
We argue that no potential attack could reveal information about sensitive private data using the silver annotations
generated by the *private teacher model* on publicly available non-sensitive data.  

Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language structuring ([NLstruct](https://github.com/percevalw/nlstruct)) library that
implements NER models that handle nested entities.  

- **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
- **Produced gold and silver annotations for the [DEFT](https://deft.lisn.upsaclay.fr/2020/) and [CAS](https://aclanthology.org/W18-5614/) French clinical corpora:** https://zenodo.org/records/6451361
- **Developed by:** [Nesrine Bannour](https://github.com/NesrineBannour), [Perceval Wajsbürt](https://github.com/percevalw), [Bastien Rance](https://team.inria.fr/heka/fr/team-members/rance/), [Xavier Tannier](http://xavier.tannier.free.fr/) and [Aurélie Névéol](https://perso.limsi.fr/neveol/)
- **Language:** French
- **License:** cc-by-sa-4.0

<!-- ## Model Sources -->
<!-- Provide the basic links for the model. -->

<!-- ## Training Details

<!-- ### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->


<!-- ### Training Procedure 

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->


<!-- #### Training Hyperparameters -->

# Download the CAS Privacy-Preserving NER Mimic Model

```python
  fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
  urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
  model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
  urllib.request.urlretrieve(model_url, "path/to/your/folder/"+ model_url.split('/')[-1])
  path_checkpoint = "path/to/your/folder/"+ model_url.split('/')[-1]
```

## 1. Load and use the model using only NLstruct
[NLstruct](https://github.com/percevalw/nlstruct) is the Python library we used to generate our 
CAS privacy-preserving NER mimic model and that handles nested entities.

### Install the NLstruct library

```  
  pip install nlstruct==0.1.0
```

### Use the model

```python 
  from nlstruct import load_pretrained
  from nlstruct.datasets import load_from_brat, export_to_brat
  
  ner_model = load_pretrained(path_checkpoint)
  test_data = load_from_brat("path/to/brat/test")
  test_predictions = ner_model.predict(test_data)
  # Export the predictions into the BRAT standoff format
  export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
```  
## 2. Load the model using NLstruct and use it with the Medkit library
[Medkit](https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data, 
including textual data. 

### Install the Medkit library

```  
  python -m pip install 'medkit-lib'
```

### Use the model
Our model could be implemented as a Medkit operation module as follows:
```python
import os
from nlstruct import load_pretrained
import urllib.request
from huggingface_hub import hf_hub_url

from medkit.io.brat import BratInputConverter, BratOutputConverter
from medkit.core import Attribute
from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils

class CAS_matcher(NEROperation):
    def __init__(self):	
        # Load the fasttext file
        fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
        if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"):
            urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
        # Load the model
        model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
        if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"):
            urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1])
        path_checkpoint = "ner_model/"+ model_url.split('/')[-1]
        
        self.model = load_pretrained(path_checkpoint)
        self.model.eval()

    def run(self, segments):
        """Return entities for each match in `segments`.

        Parameters
        ----------
        segments:
            List of segments into which to look for matches.

        Returns
        -------
        List[Entity]
            Entities found in `segments`.
        """
        # get an iterator to all matches, grouped by segment
        entities = []
        for segment in segments:
            matches = self.model.predict({"doc_id":segment.uid,"text":segment.text})
            entities.extend([entity
            for entity in self._matches_to_entities(matches, segment)
            ])
        return entities

    def _matches_to_entities(self, matches, segment: Segment):    
        for match in matches["entities"]:
            text_all,spans_all = [],[]
            
            for fragment in match["fragments"]:
                text, spans = span_utils.extract(
                    segment.text, segment.spans, [(fragment["begin"], fragment["end"])]
                )
                text_all.append(text)
                spans_all.extend(spans)

            text_all = "".join(text_all)
            entity = Entity(
                label=match["label"],
                text=text_all,
                spans=spans_all,
            )

            score_attr = Attribute(
                label="confidence",
                value=float(match["confidence"]),
                #metadata=dict(model=self.model.path_checkpoint),
            )
            entity.attrs.add(score_attr)
            yield entity

brat_converter = BratInputConverter()
docs = brat_converter.load("path/to/brat/test")
matcher = CAS_matcher()
for doc in docs:
   entities = matcher.run([doc.raw_segment])  
   for ent in entities:
       doc.anns.add(ent)
brat_output_converter = BratOutputConverter(attrs=[])
# To keep the same document names in the output folder
doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs]
brat_output_converter.save(docs, dir_path="path/to/exported_brat, doc_names=doc_names)
```


<!-- ## Evaluation of test data

<!-- This section describes the evaluation protocols and provides the results. -->


<!-- #### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

<!-- [More Information Needed]

 ### Results

[More Information Needed]

 #### Summary -->

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions are estimated using the [Carbontracker](https://github.com/lfwa/carbontracker) tool.
The used version at the time of our experiments computes its estimates by using the average carbon intensity in 
European Union in 2017 instead of the France value (294.21 gCO<sub>2</sub>eq/kWh vs. 85 gCO<sub>2</sub>eq/kWh). 
Therefore, our reported carbon footprint of training both the private model that generated the silver annotations 
and the CAS student model is overestimated.

- **Hardware Type:** GPU NVIDIA GTX 1080 Ti
- **Compute Region:** Gif-sur-Yvette, Île-de-France, France
- **Carbon Emitted:** 292 gCO<sub>2</sub>eq


## Acknowledgements
We thank the institutions and colleagues who made it possible to use the datasets described in this study: 
the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus, 
and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank 
the ITMO Cancer Aviesan for funding our research, and the [HeKA research team](https://team.inria.fr/heka/) for integrating our model
into their library [Medkit]((https://github.com/TeamHeka/medkit)).


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
If you use this model in your research, please make sure to cite our paper:

```bibtex
@article{BANNOUR2022104073,  
title = {Privacy-preserving mimic models for clinical named entity recognition in French},  
journal = {Journal of Biomedical Informatics},  
volume = {130},  
pages = {104073},  
year = {2022},  
issn = {1532-0464},  
doi = {https://doi.org/10.1016/j.jbi.2022.104073},  
url = {https://www.sciencedirect.com/science/article/pii/S1532046422000892}}  
}

```

<!-- ## Bias, Risks, and Limitations -->

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

<!-- [More Information Needed] -->