Tornike Onoprishvili | Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology

@misc{zimmermann2024virchow2scalingselfsupervisedmixed,
      title={Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology}, 
      author={Eric Zimmermann and Eugene Vorontsov and Julian Viret and Adam Casson and Michal Zelechowski and George Shaikovski and Neil Tenenholtz and James Hall and David Klimstra and Razik Yousfi and Thomas Fuchs and Nicolo Fusi and Siqi Liu and Kristen Severson},
      year={2024},
      eprint={2408.00738},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00738}, 
}

This is a collection of my notes about Virchow2.

Virchow2 provides three new models. All three models are vision-only transformer models used for classifying diseases from histopathology whole-slide images.

Models:

Virchow2G 1.9B
Virchow2 632M
Virchow2G Mini 22M

Smaller models are distillations of larger ones, instead of being trained independently.

Training data is exclusively a 3.1M histopathology images. Different tissue types, image authors and stains are available.

All Virchow models here are same architecture as google’s ViT. For example, the Virchow 632M has the exact same shape as ViT-H from Google. And Virchow2G is the same in shape as ViT-G.

Training details

Training images are actually given at different magnifications (called WSI’s). Same tissue is scanned in 5x, 10x, 20x and 40x resolutions. Highest res images are giga-pixel images. They are really hard to train on, directly without some clever workarounds.

This also means that WSI’s might need a different image augmentation approach compared to regular 2D images.

Training data

Where do images come from?

Virchow is trained on data from approximately 100,000 patients corresponding to approximately 1.5 million H&E stained WSIs acquired from Memorial Sloan Kettering Cancer Center (MSKCC),

How many individual images do we have?

From original virchow paper:

The training digital pathology dataset comprises 1,488,550 WSIs derived from 119,629 patients. These WSIs are all stained with H&E, a routine stain that stains the nuclei blue and the extracellular matrix and cytoplasm pink. The WSIs are scanned at ×20 resolution or 0.5 mpp using Leica scanners. It’s fair to assume that

Virchow2 uses 3.1M WSI’s. WSI’s contain same image at different magnification. What Virchow2 has is 3.1M WSIs, not images. So, image-wise they must have more.

How many GPUs?

Each model is trained on 512 NVIDIA 32GB V100 GPUs.

How is loss calculated for largest-model pretraining?

Training approach is called DINOv2. Paper is here. DINO is far from trivial to explain. Long story-short, the loss is calculated based on goodness-of-representation of the [CLS] token.