Bony & BonyWave Model Card

Self-Supervised Vision Transformers for Prostate Histopathology Analysis

Medium article: https://hpai-bsc.medium.com/medium-article-bony-744fa41b452d

Model Overview

This repository hosts two variants of the XCiT-medium model trained for prostate histopathology image analysis:

Bony: Baseline XCiT model pre-trained with DINO.
BonyWave: Enhanced variant incorporating 3D wavelet decomposition for improved feature extraction.

Both models process 224×224 RGB tiles and were trained on 2.8M image tiles from the PANDA dataset using 24× NVIDIA H100 GPUs.

Model Description

This XCiT (medium) model has been trained (from scratch) for prostate histopathology image analysis tasks, using images of size 224 × 224 pixels and 24 GPU H100. The XCiT architecture is a transformer model that uses cross-attention to process images, thereby improving performance compared to traditional CNN architectures. It was pre-trained on a large dataset using the DINO self-supervised training method.

This model is designed as an encoder on top of which decoders can be applied for downstream tasks. It has been tested on various tasks such as classification and segmentation (see the benchmarks used for evaluation).

Objective and Application Domain

This model was developed for the detection and classification of histopathological features in prostate biopsy images. It can be used for:

Detection of prostate tumors and other anomalies.
AI-assisted diagnosis for pathologists.

Specific tasks include cell segmentation and identifying relevant features for prostate histological classification.

Architecture

This medium XCiT model relies on transformer blocks, which are better suited for computer vision tasks due to their ability to capture complex spatial relationships. The architecture has been adapted to work with prostate histopathology images of size 224 × 224. The total number of parameters in this model is 84M.

Technical Details

The XCiT model is trained using the DINO framework, a self-supervised training framework that uses a discriminative objective to learn representations without explicit supervision. The XCiT architecture combines the advantages of transformers while using an efficient attention mechanism to handle the high-dimensional nature of histopathology images.

The loss function used during pre-training is defined as:

$L_{DINO} = - \sum_{i} p(t_i | \theta) \log q(s_i | \phi)$

where ( p(t_i | \theta) ) is the target distribution (t for teacher) and ( q(s_i | \phi) ) is the student distribution.

Pre-training with DINO

The model was pre-trained using the DINO method, a self-supervised pre-training algorithm based on a contrastive objective where the model learns to maximize similarity between augmented views of the same image. This pre-training is performed without any labels, using only histopathology images. The model has been trained on 2.8 million image tiles (224 × 224).

Training Procedure

The model was trained with an adaptive learning rate of 0.00075 in the beginning, using the Adam optimizer. The pre-training was conducted on a prostate histopathology image dataset (PANDA dataset), with images of size 224 × 224 pixels cropped without overlap from the PANDA TIFF images (high-dimensional images).

Here are all the hyperparameters:

Architecture: XCiT_medium
Patch size: 16
Drop path rate: 0.1
Output dimension (out_dim): 4096
Number of local crops: 5
Teacher temperature (teacher_temp): 0.07
Teacher temperature during warmup (warmup_teacher_temp): 0.04
Warmup epochs for teacher: 10
Training epochs: 15
Learning rate (lr): 0.00075
Minimum learning rate (min_lr): 2e-06
Warmup epochs for learning rate: 10
Batch size per GPU: 64
Weight decay: 0.05
Weight decay at the end of training (weight_decay_end): 0.4
Teacher momentum: 0.996
Clip gradient: 3.0
Batch size for DataLoader: 64
Parameter norms: None (param_norms = None)
Freeze last layer: Yes (freeze_last_layer = 1)
Use FP16 scaler: Yes (fp16_scaler_b = True)
Number of workers: 10
Global crops scale (global_crops_scale): (0.25, 1.0)
Local crops scale (local_crops_scale): (0.05, 0.25)
Distribution URL: "env://"

Performance

The model achieved a classification accuracy of 81% on the PANDA subset and a segmentation performance of 2.9e-6 (with MSE) on the DeepGleason prostate histopathology dataset. It was also tested on the SICAPv2 benchmark. The model’s performance was compared to other models, such as Hibou, a ViT model trained on 1.2 billion tiles of 224 × 224. For DeepGleason and SICAPv2, segmentation has been performed using the Mean Squared Error (MSE). The summary table is as follows:

Model	PANDA test subset (Accuracy) ↑	DeepGleason (MSE) ↓	SICAPv2 (MSE) ↓
Bony	81.2%	2.934e-06	8.0e-04
BonyWave	83.0%	3.9e-04	7.9e-04
Hibou	83.1%	1.455e-06	0.10
Histoencoder	81.6%	1.003e-06	-

Wavelet Decomposition

As previously mentioned, histopathology images are highly discontinuous, noisy, and often visually similar. Therefore, applying a filter to these images might help abstract their information, enabling more stable and potentially more effective training. This is why I believe that incorporating wavelet decomposition before the forward pass in our XCiT model could be a promising approach.

Overview of 3D Wavelet Decomposition

Wavelets are oscillating functions localized in time and space, used to decompose a signal ( f(x, y, z) ) into multiple scales and orientations. 3D wavelet decomposition is a method well-suited for analyzing volumetric data, such as (224 \times 224 \times 3) images, by extracting localized information at different spatial scales.

We conducted small-scale experiments using Haar wavelets, considering a single decomposition scale and focusing on the "Approximation" of the image. Despite these limitations, training revealed some potential. We tested this idea on the PANDA subset benchmark and Bony_wave achieved a 83% accuracy on the test. For more details see https://hpai-bsc.medium.com/medium-article-bony-744fa41b452d

Limitations and Biases

Although this model was trained for a specific prostate histopathology analysis task, there are several limitations and biases:

Performance may be affected by the quality of input images, particularly in cases of low resolution or noise.
The model may be biased by the distribution of the training data, which may not be representative of all patient populations.
The model may struggle with images containing artifacts or specific conditions not encountered in the training dataset.
This model may not be used for images other than prostate histopathology images as it has only been trained on this kind of data.
This model shall not be used for diagnosis alone.

About

Main model developed and trained by Emile Vaysse, under the supervision of Dario Garcia-Gasulla. For more details, see the full thesis report (in french).

HPAI-BSC
/

Bony