Scaling Self Supervised Learning for Histology: introducing Phikon

Community Article Published October 31, 2023

The recent innovations in machine learning and computer vision have opened new opportunities for applying AI in medicine. Digital pathology is one field where image representation and classification can allow for new research breakthroughs and more efficient disease diagnosis, which both contribute to better patient outcomes.

We are thrilled to announce Phikon, a model developed by Owkin using the self-supervised transformer based framework iBOT. In this article we elaborate on the datasets, training techniques, and frameworks used to build the model. We’ve also built a Colab notebook so that the community can easily leverage the model to extract features or finetune their own dataset using the Hugging Face Transformers and PEFT libraries.

Useful links:

Context

Histopathology is an important technique for cancer diagnosis and treatment planning. This technique requires pathologists to analyse disease tissue on slides, using microscopy to identify patterns and markers of disease. Being able to automate or augment parts of these workflows using deep learning could have a huge impact for patients - potentially allowing cancer to be diagnosed quicker and more accurately.

Just as in other domains however, the bottleneck for implementing AI can often be the availability of high-quality data. The images needed to train these models have to be annotated by trained pathologists in a costly and labor-intensive manner.

In the past, this issue has been overcome by implementing transfer learning with models trained on ImageNet. Convolutional Neural Networks (CNNs) trained on ImageNet perform well and serve as powerful feature extractors for histology images, but suffer from the limitations of out-of-domain pre-training. Histology images include complex cellular structures, while different data sources also vary in color, texture and staining. Models pre-trained on ImageNet therefore struggle to adequately capture important details that are critical for disease diagnosis.

Recently however, self-supervised learning has emerged as a promising solution to leverage large volumes of unlabelled data. Our research shows the strengths of in-domain pretraining using Masked Image Modelling (MIM). MIM is a technique inspired from BERT - where portions of an image (patches or pixels) are randomly masked. The model can then learn meaningful representations by reconstructing the masked portions.

Training

The pre-training data for the model consists of publicly available datasets derived from The Cancer Genome Atlas (TCGA). We considered 3 different datasets: COAD4M (colorectal cancer), PanCancer4M and PanCancer40M (multiple cancer sites). We applied our whole slide image (WSI) processing pipeline consisting of matter detection, artefact removal and parallelized tiling to extract tiles from the WSIs for efficient training.

How tiles are extracted from WSIs

We used the iBOT framework to train our models on tiles extracted from WSIs. In order to assess their scalability we built five different models varying in size, architecture and pre-training data. Information about the different models we trained can be found below. The MoCoV2 model used for comparison is a model previously published by Owkin.

We then tested the model on 17 different downstream tasks, one of which is metastasis detection using Camelyon16, a dataset of H&E-stained slides from lymph node sections.

The code below shows how simple it is to use the model to extract features from a histology tile.


import requests
from PIL import Image

import torch
from transformers import AutoImageProcessor, ViTModel

# download an image from GTeX website
url = "https://biospecimens.cancer.gov/gtexbiobank/images/histology_inset.jpg"
image = Image.open(requests.get(url, stream=True).raw).resize((224, 224))

# load phikon
image_processor = AutoImageProcessor.from_pretrained("owkin/phikon")
model = ViTModel.from_pretrained("owkin/phikon", add_pooling_layer=False)

# process the image
inputs = image_processor(image, return_tensors="pt")

# get the features
with torch.no_grad():
    outputs = model(**inputs)
    features = outputs.last_hidden_state[:, 0, :]  # (1, 768) shape

As this feature extraction can be computationally costly, we’ve shared a dataset of Camelyon16 features already extracted using the model. These features are used in the notebook on a smaller model for cancer classification.

Finetuning

The real strength of this model is how it can be fine-tuned to improve performance on specific cancer subtypes. In our notebook we show how to specialize Phikon on colorectal cancer through both transfer learning and fine-tuning using LoRA. We trained the model on a subset of the NCT-CRC dataset, which we’ve also uploaded on Hugging Face.

There is still a need to investigate the best way to optimize model performance and whether that would come from fine-tuning the entire model, the last transformer blocks, or through LoRA.

We encourage anyone interested in histopathology to utilize the prepared notebook for fine-tuning the model for their own use case, as the model performs well on out-of-domain tasks.

Conclusion

To further evaluate the model, we explored the strengths of in-domain pre-training vs out-of-domain pre-training, as well as the strength of the iBOT framework over purely contrastive methods.

Furthermore, we also explored the effects that scaling model architecture would have on overall performance. Scaling the ViT-S (21.7M) model to ViT-B (85.8M) strongly improved performance with an average gain of 2.5% across all tasks. Surprisingly however, the results demonstrate that further increasing the size of the architecture from ViT-B (85.8M) to ViT-L (307M) did not improve performance, and in fact led to an overall performance loss of 0.2%.

Performance comparison for different model sizes

This serves as an important example for how scaling model size does not always lead to improved performance if dataset size isn’t also scaled accordingly.

We then built our best performing Phikon model by retraining the ViT-B (85.8M) model while increasing the dataset size and diversity.

The push and pull between fully training and fine-tuning a model is always a challenging task, and that is particularly true in the medical domain. Marginal improvements in performance are overvalued relative to other applications, due to the gravity of the use case. We invite you to read the full paper if you’re interested in more detail about the training and the hyperparameters used.

Our findings would benefit from further exploration and validation - in particular with respect to dataset curation. We believe this work paves the way for an open source foundation model for histopathology.