• Ешқандай Нәтиже Табылған Жоқ

Nazarbayev University Repository

N/A
N/A
Protected

Academic year: 2023

Share "Nazarbayev University Repository"

Copied!
47
0
0

Толық мәтін

(1)

www.nu.edu.kz

Few-shot Medical Image Classification using Vision Transformers

Author: Maxat Nurgazin Supervisor: Nguyen Anh Tu Co-Supervisor: Min-Ho Lee

Master Thesis:

Master Student at NU

Assistant Professor, School of Engineering and Digital Sciences at NU Assistant Professor, School of Engineering and Digital Sciences at NU

1

(2)

Outline:

1. Introduction

Background

Motivation

Objectives

2. Related works

Proposed idea

3. Methodology

Problem definition

Few-shot learning

System Pipeline

Reptile

Prototypical Networks

Custom ViT

Advanced Augmentation techniques

4. Experiments and results

Datasets

Implementation details

Results analysis

5. Conclusions

2

(3)

www.nu.edu.kz

Introduction

3

(4)

Medical Image Analysis (MIA) is critical for diagnosing diseases and conditions from medical imaging.

Machine learning, particularly deep learning, has shown promising results in MIA tasks, like medical image classification (MIC).

Convolutional Neural Networks (CNNs) have been state-of-the-art in computer vision, including medical imaging.

Vision Transformers (ViTs) have emerged as an

alternative to CNNs, showing impressive performance on various tasks.

Background

4

Fig. 1. Deep learning application in medical image analysis. (A) Fundus detection; (B,C) hippocampus segmentation; (D) left ventricular segmentation; (E) pulmonary nodule classification; (F,G,H,I) gastric cancer pathology segmentation. Acquired from [27]

[27] L. Cai, J. Gao, and D. Zhao, “A review of the application of deep learning in medical image classification and segmentation,” Annals of translational medicine, vol. 8, no. 11, 2020.

(5)

● CNNs struggle with learning long-range pixel relationships due to locality, which ViTs can handle more effectively.

● ViTs lack inductive bias, they can learn better.

● Medical imaging often has limited labeled data, making it difficult to train deep learning models.

● Few-shot learning (FSL) is a promising approach for handling limited labeled data.

● Investigating the use of ViTs in few-shot learning for MIC is the main motivation of this thesis.

Motivation

5

(6)

● Investigate the performance of ViTs in few-shot learning scenarios for MIC and compare it with traditional CNNs.

● Design a custom ViT architecture and evaluate its performance

● Use different few-shot learning algorithms and assess the performance of ViTs.

● Investigate the effects of advanced data augmentation techniques (Cutout, Mixup, and Cutmix) on ViT

performance for FSL.

Research Objectives

6

(7)

● To our knowledge, ViT architectures have not been used in the field of medical image classification in few-shot

learning scenarios.

● Therefore, given their success in other areas of computer vision, it is important to assess their performance in this area under various conditions.

General Objective

7

(8)

● MIA: Medical Image Analysis

● MIC: Medical Image Classification

● CNN: Convolutional Neural Network

● ViT: Vision Transformer

● FSL: Few-shot Learning

● ProtoNet: Prototypical Network

Terminology

8

(9)

www.nu.edu.kz

Related Works

9

(10)

Limited number of papers on few-shot learning with ViT

● Hu et al. investigated a simple FSL pipeline with ViT and ResNet50 backbones. Their pipeline with a ViT

outperformed the one with a CNN. [1]

● Chen et al. used a vanilla ViT with masking operation to improve few-shot learning performance. It resulted in improved results. [2]

However, this works do not consider medical datasets.

Few-shot Learning with ViT

[1] Shell Xu Hu, Da Li, Jan Stühmer, Minyoung Kim, and Timothy M Hospedales. Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a difference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9068–9077, 2022.

[2] Yuzhong Chen, Zhenxiang Xiao, Lin Zhao, Lu Zhang, Haixing Dai, David Weizhong Liu, Zihao Wu, Changhe Li, Tuo Zhang, Changying Li, et al.

Mask-guided vision transformer (mg-vit) for few-shot learning. arXiv preprint arXiv:2205.09995, 2022. 10

(11)

Krishnan and Krishnan fine-tuned off-the-shelf CNN and ViT models for medical image classification. ViT achieved the highest accuracy. [3]

Perera et al. proposed a lightweight transformer architecture called POCFormer for COVID-19 detection on portable devices and reported comparable scores with bigger models. [4]

Duong et al. combined CNN and ViT for Tuberculosis detection in Chest X-ray images and reported high scores. [5]

Behrendt et al. systematically compared ViTs and CNNs for multi-label medical image classification. DeiT outperformed other models across all dataset sizes. [6]

These works and others show that ViTs can be successfully used for MIC. However, they do not consider FSL.

Medical Image Classification using Vision Transformers

[3] Koushik Sivarama Krishnan and Karthik Sivarama Krishnan. Vision transformer based covid-19 detection using chest x-rays. In 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), pages 644–648. IEEE, 2021.

[4] Shehan Perera, Srikar Adhikari, and Alper Yilmaz. Pocformer: A lightweight transformer architecture for detection of covid-19 using point of care ultrasound. In 2021 IEEE International Conference on Image Processing (ICIP), pages 195– 199. IEEE, 2021.

[5] Linh T Duong, Nhi H Le, Toan B Tran, Vuong M Ngo, and Phuong T Nguyen. Detection of tuberculosis from chest x-ray images: boosting the performance with vision transformer and transfer learning. Expert Systems with Applications, 184:115519, 2021.

[6] Finn Behrendt, Debayan Bhattacharya, Julia Krüger, Roland Opfer, and Alexander Schlaefer. Data-efficient vision transformers for multi-label disease classification on chest radiographs. Current Directions in Biomedical Engineering, 8(1):34–37, 2022.

11

(12)

Singh et al. proposed MetaMed, a meta-learning-based approach for few-shot medical image classification using Reptile and a simple CNN. [7]

Dai et al. introduced PFEMed, a novel few-shot classification method for medical images. This approach surpassed MetaMed on the Pap smear dataset by over 2.63%. [8]

Cherti and Jitsev investigated the effect of pre-training scale on intra- and inter-domain transfer settings. Demonstrated transfer learning benefited from larger pre-training scales. [9]

These works use CNNs for FSL in MIC. However ViTs are outperforming CNNs.

Medical Image Classification using Few-Shot Learning

[7] Rishav Singh, Vandana Bharti, Vishal Purohit, Abhinav Kumar, Amit Kumar Singh, and Sanjay Kumar Singh. Metamed: Few-shot medical image classification using gradient-based meta-learning. Pattern Recognition, 120:108111, 2021.

[8] Zhiyong Dai, Jianjun Yi, Lei Yan, Qingwen Xu, Liang Hu, Qi Zhang, Jiahui Li, and Guoqiang Wang. Pfemed: Few-shot medical image classification using prior guided feature enhancement. Pattern Recognition, 134:109108, 2023.

[9] Mehdi Cherti and Jenia Jitsev. Effect of pre-training scale on intra-and inter-domain full and few-shot transfer learning for natural and medical x-ray chest images. arXiv preprint arXiv:2106.00116, 2021. 12

(13)

In this work, we aimed to bridge the gap in knowledge by:

● Employing various ViT architectures in few-shot learning for medical image classification

● Evaluating their performance by comparing them with similar CNNs

● Examining the impact of advanced data augmentation techniques

We utilized two FSL algorithms - Prototypical Networks and Reptile

Proposed idea

13

(14)

www.nu.edu.kz

Methodology

14

(15)

● Let D = D1, D2, ..., Dn be a collection of n medical datasets, with each dataset Dk consisting of pairs (x, y)j representing an image and its label.

● Datasets are divided into meta-test set (Dmeta-test) and meta-train set (Dmeta-train)

● Utilize abundant data in Dmeta-train to learn better initial weights (Reptile) or develop effective embedding space (ProtoNet)

● Goal: Improve performance on problems Dmeta-test with limited data (novel class data)

Problem Definition

15

(16)

● Goal: Develop models that generalize effectively to new tasks with limited labeled examples

● Task difficulty: N-way-K-shot (N = number of classes, K = samples per class). Ex:

3-way-3-shot

● Support set for training, Query set for testing

Few-shot Learning

16

Fig 2. N-way-K-shot task example (3-way-3-shot)

(17)

Meta-learning - Learn to solve new tasks by drawing

experience from previous tasks. Sharing a learning method.

Models that adapt quickly to new task from few examples

Divided into meta-training and meta-testing phases

Data presented episodically (one episode is a n-way-k-shot task)

Transfer learning - Pre-train on a large dataset, then fine-tune on limited noval data. Sharing learned knowledge.

Less effective when there's a large domain gap between source and target datasets

Data augmentation - Generate new samples by augmenting limited support set

Enhances the limited support set by generating new samples through various techniques

Few-shot Learning Approaches

17

(18)

System Pipeline

Fig 3. Overall Pipeline 18

(19)

Meta-learning algorithm designed for few-shot learning

Find init. params that can be quickly adapted to new tasks

Two-stage process: inner loop updates and outer loop updates

Quick training and simple implementation compared to MAML

(1) Cross entropy loss used during meta-training and meta-testing phases

Reptile [10]

(1)

[10] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018. 19

(20)

● Learn a prototype for each class in the embedding space

● The figure on the right demonstrates the classification procedure

Prototypical Networks [11]

(1)

(2)

[11] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.

20

(21)

● Add Squeeze & Excitation (SE) block to ViT_small [18] and ConViT_small [28]

● Adapted from [29], where it showed an improvement over vanilla ViT.

● We also use SE block with ConViT.

● ConViT is similar to original ViT

But uses gated positional self-attention in some layers for convolutional inductive bias of locality.

Custom ViT

21

[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[28] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft convolutional inductive biases,” in International Conference on Machine Learning, 2021, pp. 2286–2296.

[29] M. Aouayeb, W. Hamidouche, C. Soladie, K. Kpalma, and R. Seguier, “Learning vision transformer with squeeze and excitation for facial expression recognition,” arXiv preprint arXiv:2107.03107, 2021.

Fig. 4. Architecture of Custom ViTs. Figure is adapted from [18], [28], and [29].

(22)

● Encourage model to learn more generalized

representations by providing more diverse and robust training data

● Techniques used: Cutout, Mixup, and Cutmix

● Note: Only Cutout is compatible with ProtoNet algorithm

Advanced Augmentation Techniques

22

(23)

● Cutout is a data augmentation technique that randomly removes rectangular regions from input images during training.

Cutout [12]

[12] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint

arXiv:1708.04552, 2017. 23

(24)

● Mixup is a method that involves generating new training examples by taking a linear combination of two randomly chosen input images and their corresponding labels.

Mixup [13]

+ Lambda = 0.5

Label = 0 Label = 1

Label = 0.5

[13] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint

arXiv:1710.09412, 2017. 24

(25)

● Cutmix is a method that combines the strengths of both Cutout and Mixup. The idea behind

Cutmix is to replace a portion of an input image x1 with another image x2, while also adjusting the corresponding labels accordingly.

Cutmix [14]

+

[14] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train

strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019. 25

(26)

www.nu.edu.kz

Experiments and Results

26

(27)

● Three publicly available medical imaging

datasets: BreakHis, ISIC 2018, and Pap Smear

● Images downsampled to 224x224 for compatibility with pre-trained models

● Each dataset contains at least six classes for 2- and 3-way n-shot learning

ISIC 2018 [15]:

● 10,015 dermoscopic images of skin lesions across 7 classes

● 4 meta-train and 3 meta-test classes

Datasets: ISIC 2018

[15] Jinyi Zou, Xiao Ma, Cheng Zhong, and Yao Zhang. Dermoscopic image analysis for isic challenge 2018. arXiv preprint arXiv:1807.08948, 2018. 27

(28)

BreakHis [16]:

● 9109 microscopic images of breast tumor tissues from 82 patients

● 8 classes, with 5 meta-train and 3 meta-test classes

Pap Smear [17]:

● 917 microscopic images of cervical smears

● 7 distinct classes, with 4 meta-train and 3 meta-test classes

BreakHis and Pap smear

[16] Fabio A Spanhol, Luiz S Oliveira, Caroline Petitjean, and Laurent Heutte. A dataset for breast cancer histopathological image classification. Ieee transactions on biomedical engineering, 63(7):1455–1462, 2015.

[17] Jan Jantzen, Jonas Norup, Georgios Dounias, and Beth Bjerregaard. Pap-smear benchmark data for pattern classification.

Nature inspired Smart Information Systems (NiSIS 2005), pages 1–9, 2005.

28

(29)

● ViT family [18]: ViT_tiny, ViT_small, and ViT_base

● Other ViT architectures: Mobile_ViT (MViT_v2_0.5) [19], DeiT_base [20], and Swin_base [21]

● CNN models: ResNet50 [22] and VGG16 [23]

● All models pre-trained on the ImageNet1k dataset

Models

Model Dim Parameters

ViT_tiny 192 5.5m

MViT_v2_0.5 384 1.4m

ViT_small 384 22m

ViT_base 768 85m

DeIT_base 768 85m

Swin_base 1024 86m

ResNet50 2048 23.5m

VGG16 4096 134m

[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[19] Sachin Mehta and Mohammad Rastegari. Separable self-attention for mobile vision transformers, 2022.

[20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.

[21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015. 29

(30)

● Python programming language

● PyTorch framework [24]

● Pre-trained models obtained from the timm library [25]

● ProtoNet experiments conducted using the easyfsl library [26]

● Hardware specifications:

PC: NVIDIA RTX 3060 Ti, Intel i5-10400 CPU, 16GB RAM

Google Colab Pro Platform: NVIDIA Tesla T4 or A100

Implementation details and Settings

[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.

[25] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.

[26] Etienne Bennequin. easyfsl. https://github.com/sicara/easy-few-shot-learning, 2021.

30

(31)

● Utilized pre-trained model checkpoints

● Employed data augmentation techniques

● ProtoNet: 20 epochs, 500 episodes per epoch, SGD optimizer,

learning rate of 10-5 or 10-6, cosine annealing learning rate schedule

● Reptile: SGD optimizer, learning rate of 10-3 for inner optimization, learning rate of 10-1 for outer meta-update, 1000 meta-iterations, batch size of 10 tasks, 5 and 50 adaptation steps for each task

Training

31

(32)

● Accuracy (%) as evaluation metric

● 400 episodes randomly selected from novel categories in the test set

● Average accuracy rate for image classification

● Tested 2- and 3-way 2-, 5-, and 10-shot few-shot learning scenarios

Evaluation

32

(33)

● Investigates results of pre-trained models in

few-shot classification tasks without meta-training

● Focus on ISIC 2018 and BreakHis x100 datasets

● Pure transfer learning, no fine-tuning to meta-datasets

● Models directly used as a backbone of a ProtoNet

● Serves as a baseline for further sections

Pretrained ViTs without Meta-training

33

(34)

● Models with more parameters generally show better performance

● Mobile ViT (MViT_v2_0.5) has the lowest score, followed by ViT_tiny

● ViT and CNN models show comparable results

● Results only serve as an initial baseline and should not be used to judge overall

performance of models in few-shot learning

Pretrained ViT without Meta-training: Observations

BreakHis X100 Pretaining Only

Model 2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot ViT_tiny 71.25 77.35 78.50 57.27 61.53 67.87 ViT_small 74.71 79.42 83.24 63.22 69.25 73.91 ViT_base 74.50 80.70 84.90 63.90 69.17 75.50 Swin_base 77.95 83.20 65.37 72.77 80.30 82.3 ResNet50 79.62 83.31 85.72 68.75 73.09 77.61

VGG16 70.40 79.15 81.75 60.70 65.40 71.67

34

ISIC 2018 Pretaining Only

Model 2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot ViT_tiny 70.25 74.60 76.15 54.83 59.51 65.41 MViT_v2_0.5 59.30 63.10 67.80 46.13 48.27 49.90 ViT_small 77.40 81.89 85.95 63.67 69.84 75.28 ViT_base 74.75 77.70 82.45 60.73 65.73 69.97 DeIT_base 71.75 79.40 81.75 58.33 61.87 69.47 Swin_base 75.10 80.15 82.00 62.27 67.67 71.50 ResNet50 72.66 76.17 79.15 56.69 62.31 65.81

VGG16 72.45 78.60 81.30 60.00 65.87 68.20

(35)

35

ISIC 2018 Algorith

m Model

2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot

Protonet MViT_v2

_0.5 74.64 76.94 81.50 60.60 64.23 69.23

ViT_tiny 81.03 83.61 86.52 67.84 71.82 77.68 ViT_small

84.35 86.70 89.72 72.10 76.18 81.45

ViT_base

83.94 86.02 90.26 72.75 77.69 81.99

Swin_bas

e 82.49 84.17 89.12 70.75 74.67 79.92

ResNet5

0 66.62 68.65 72.81 51.43 53.83 58.34

VGG16 72.32 76.04 80.69 57.81 61.86 66.92

PN w/o Pretranin g

ViT_small

56.19 57.55 60.17 39.87 41.08 41.88

Reptile 5 steps

ViT_small

71.23 76.65 81.38 66.20 72.23 78.10

ResNet5

0 59.50 62.80 65.78 42.62 43.22 44.13

Reptile 50 steps

ViT_small

76.05 80.3 85.55 67.5 73.15 77.37

ResNet5

0 66.68 72.13 77.03 53.63 57.03 60.18

BreakHis X100

Algorith m

Model 2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot

Protonet MViT_v2

_0.5 76.89 79.60 84.65 64.51 71.43 77.05

ViT_tiny 75.34 79.44 83.53 62.64 69.88 75.18 ViT_small 80.64 83.80 87.62 69.39 75.91 81.47 ViT_base 79.33 81.65 84.62 68.52 73.27 76.38 Swin_bas

e 79.46 82.86 86.26 68.34 74.28 80.51

ResNet5

0 68.62 72.12 73.31 55.80 60.28 61.88

VGG16 67.06 69.70 74.74 52.89 57.94 61.15

Reptile 5 steps

ViT_small 66.90 74.20 81.80 47.37 57.17 68.47 ResNet5

0 64.90 67.60 73.25 34.70 36.33 38.23

Reptile 50 steps

ViT_small 73.45 77.9 86.18 55.05 63.38 75.92 ResNet5

0 72.15 76.63 80.33 60.33 63.45 68.47

Pap Smear

Algorith m

Model 2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot

Protonet MViT_v2

_0.5 80.84 84.36 86.88 68.04 73.24 78.37

ViT_tiny 84.65 86.96 88.86 74.33 77.92 81.17 ViT_small 92.40 94.05 94.90 86.38 89.09 90.62 ViT_base 92.05 93.26 93.94 85.21 88.48 89.47 Swin_bas

e 85.42 87.56 89.78 75.73 79.88 82.46

ResNet5

0 70.49 71.75 69.61 57.74 58.48 59.60

VGG16 87.95 90.11 91.45 78.21 81.81 84.32

Reptile 5 steps

ViT_small 83.35 87.05 91.96 72.52 81.13 87.94 ResNet5

0 71.44 74.59 78.39 48.00 49.86 50.44

Reptile 50 steps

ViT_small 85.85 88.33 92.55 76.75 82.58 86.92 ResNet5

0 86.60 90.38 90.85 65.73 67.75 73.83

Meta-Training Results

(36)

Analyzing test results of few-shot classification models using ProtoNet and Reptile meta-learning algorithms

ViTs paired with ProtoNet showed noticeable

performance gains across all datasets and FSL tasks

Mobile ViT and ViT, being the smallest models, showed lower results

ViT_small demonstrated the highest results in most cases, often outperforming bigger models

CNNs perform worse after meta-training than before and are generally non-competitive

Meta-Training Results

36

(37)

Importance of pretraining highlighted

ProtoNet with ViT_small backbone pretrained on ImageNet1k has accuracy scores up to 30% higher when meta-trained

Indicates learning a more discriminative feature representation space

ViT_small outperforms ResNet50 in most tasks across datasets

Meta-Training Results: ProtoNet

37

ISIC 2018 Algorith

m Model

2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot PNw\oPT ViT_small 56.19 57.55 60.17 39.87 41.08 41.88

Protonet

MViT_v2_0.5 74.64 76.94 81.50 60.60 64.23 69.23

ViT_tiny 81.03 83.61 86.52 67.84 71.82 77.68

ViT_small 84.35 86.70 89.72 72.10 76.18 81.45

ViT_base 83.94 86.02 90.26 72.75 77.69 81.99

Swin_base 82.49 84.17 89.12 70.75 74.67 79.92

ResNet50 66.62 68.65 72.81 51.43 53.83 58.34

VGG16 72.32 76.04 80.69 57.81 61.86 66.92

BreakHis X100

Protonet

MViT_v2_0.5 76.89 79.60 84.65 64.51 71.43 77.05

ViT_tiny 75.34 79.44 83.53 62.64 69.88 75.18

ViT_small 80.64 83.80 87.62 69.39 75.91 81.47

ViT_base 79.33 81.65 84.62 68.52 73.27 76.38

Swin_base 79.46 82.86 86.26 68.34 74.28 80.51

ResNet50 68.62 72.12 73.31 55.80 60.28 61.88

VGG16 67.06 69.70 74.74 52.89 57.94 61.15

Pap Smear

Protonet

MViT_v2_0.5 80.84 84.36 86.88 68.04 73.24 78.37

ViT_tiny 84.65 86.96 88.86 74.33 77.92 81.17

ViT_small 92.40 94.05 94.90 86.38 89.09 90.62

ViT_base 92.05 93.26 93.94 85.21 88.48 89.47

Swin_base 85.42 87.56 89.78 75.73 79.88 82.46

ResNet50 70.49 71.75 69.61 57.74 58.48 59.60

VGG16 87.95 90.11 91.45 78.21 81.81 84.32

(38)

Performance highly dependent on proper hyperparameter selection

Noticeable performance increase when task-adapted for more steps (5 to 50)

ViTs adapt faster (with fewer steps)

ViTs with Reptile still lower in performance compared to ProtoNets

ResNet50 showed much better results with Reptile

ProtoNet with a ViT backbone is a better option than a CNN paired with Reptile due to ease of use, training, better performance, and lower complexity

Meta-Training Results: Reptile

38

ISIC 2018

Algorithm Model

2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot Protonet

w/o Pretraning

ViT_small

56.19 57.55 60.17 39.87 41.08 41.88

Reptile 5 steps

ViT_small 71.23 76.65 81.38 66.20 72.23 78.10

ResNet50 59.50 62.80 65.78 42.62 43.22 44.13

Reptile 50 steps

ViT_small 76.05 80.3 85.55 67.5 73.15 77.37

ResNet50 66.68 72.13 77.03 53.63 57.03 60.18

BreakHis X100 Reptile 5

steps

ViT_small 66.90 74.20 81.80 47.37 57.17 68.47

ResNet50 64.90 67.60 73.25 34.70 36.33 38.23

Reptile 50 steps

ViT_small 73.45 77.9 86.18 55.05 63.38 75.92

ResNet50 72.15 76.63 80.33 60.33 63.45 68.47

Pap Smear Reptile 5

steps

ViT_small 83.35 87.05 91.96 72.52 81.13 87.94

ResNet50 71.44 74.59 78.39 48.00 49.86 50.44

Reptile 50 steps

ViT_small 85.85 88.33 92.55 76.75 82.58 86.92

ResNet50 86.60 90.38 90.85 65.73 67.75 73.83

(39)

39

For ProtoNet:

Cutout resulted in lower scores for most tasks (ViT_small and ResNet50)

For Reptile:

Cutout led to lower performance in most cases, except for 3-way k-shot tasks of ResNet50

CutMix generally resulted in lower scores for the majority of tasks

Mixup showed an uplift in accuracy scores in 4 tasks (ResNet50) and 3 tasks (ViT_small) out of 6

Mixup performs better than other techniques and is recommended as a good data

augmentation technique

Data Augmentation Techniques - ISIC 2018 Dataset

ISIC 2018

Algorit

hm Model FSL

2-way 3-way

Standa rt

CutOu

t MixUp

CutMi x

Standa rt

CutOu

t MixUp

CutMi x

ProtoN et

ViT_sm all

3 shot 84.35 81.73 - - 72.10 70.55 - -

5 shot 86.70 85.89 - - 76.18 76.23 - -

10 shot 89.72 89.22 - - 81.45 81.13 - -

ResNet 50

3 shot 66.62 65.52 - - 51.43 49.32 - -

5 shot 68.65 68.75 - - 53.83 53.81 - -

10 shot 72.81 72.18 - - 58.34 57.74 - -

Reptile ViT_sm all

3 shot 76.05 75.30 77.50 74.85 67.50 64.87 66.20 67.40 5 shot 80.30 80.35 79.40 77.75 73.15 69.97 71.33 72.57 10 shot 85.55 83.95 85.75 85.65 77.37 76.53 77.87 79.63

ResNet 50

3 shot 70.28 68.73 70.75 70.10 54.47 55.70 55.00 53.70 5 shot 75.78 73.60 74.15 74.60 58.22 59.90 60.65 58.92 10 shot 78.83 76.58 78.03 77.95 61.58 64.67 64.95 63.62

(40)

Focused on ViT_small and ResNet50 models with both ProtoNet and Reptile

Meta-training without advanced augmentation techniques

Note: MetaMed used a simple CNN model with only 3840 parameters, making the comparison not entirely fair

Key Observations:

ViT_small outperforms other models in all cases with ProtoNet, and in most cases with Reptile

ResNet50 lags behind the performance of other models, including those presented in the MetaMed paper

Comparison with MetaMed [7]

ISIC 2018

Algorithm Model

2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot ProtoNet ViT_small 84.35 86.70 89.72 72.10 76.18 81.45

ResNet50 66.62 68.65 72.81 51.43 53.83 58.34

Reptile

ViT_small 76.05 80.30 85.55 67.50 73.15 77.37 ResNet50 70.28 75.78 78.83 54.47 58.22 61.58 MetaMed 72.75 75.62 81.37 54.83 59.33 69.75

BreakHis X100

Algorithm Model

2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot ProtoNet ViT_small 80.64 83.80 87.62 69.39 75.91 81.47

ResNet50 68.62 72.12 73.31 55.80 60.28 61.88

Reptile

ViT_small 73.45 77.90 86.18 55.05 63.38 75.92 ResNet50 72.15 76.63 80.33 60.33 63.45 68.47 MetaMed 78.75 81.38 83.88 63.08 66.42 74.08

Pap Smear

Algorithm Model

2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot ProtoNet ViT_small 92.40 94.05 94.90 86.38 89.09 90.62

ResNet50 70.49 71.75 69.61 57.74 58.48 59.60

Reptile

ViT_small 83.35 87.05 91.96 72.52 81.13 87.94 ResNet50 71.44 74.59 78.39 48.00 49.86 50.44 MetaMed 85.37 86.50 89.37 70.58 72.42 83.00 [7] Rishav Singh, Vandana Bharti, Vishal Purohit, Abhinav Kumar, Amit Kumar Singh, and Sanjay Kumar Singh. Metamed: Few-shot medical image classification

using gradient-based meta-learning. Pattern Recognition, 120:108111, 2021.

40

(41)

● Custom ViTs were pre-trained on ImageNet1k, then on CIFAR 100.

● Generally, lower results when compared with unmodified models.

● CIFAR 100 is too small for a proper pre-training

● Later, use bigger datasets.

Custom ViT Preliminary Results

41

ISIC 2018

Algorithm Model

2-way 3-way

3-shot 5-shot 10-shot 3-shot 5-shot 10-shot

Protonet

ViT_small 84.35 86.70 89.72 72.10 76.18 81.45

ViT_small_SE 77.84 80.66 84.36 64.30 68.24 74.66

ConViT 76.33 78.89 82.94 63.17 67.07 71.96

ConViT_SE 75.21 77.18 81.71 60.94 65.11 69.60

(42)

www.nu.edu.kz

Conclusion

42

(43)

In this work we have shown that:

● ViTs can be effectively used for few-shot medical image classification outperforming comparable CNNs.

Especially with the ProtoNet FSL algorithm.

Reptile performance depends highly on hyperparameter selection.

Additionally, we demonstrated that:

● ViTs reach high performance with a simpler ProtoNet.

● Tested a custom ViT architecture with SE block for FSL.

Have only preliminary results. Lower performance than unmodified.

● Advanced augmentation techniques showed mixed results

Mixup improved accuracy scores in most cases.

Cutout and Cutmix showed positive results in less than 50% of tasks.

Conclusion and Key Contributions

43

(44)

● Designing a ViT architecture fit for FSL.

● Investigating usage synthetic data

augmentation/generation techniques (e.g., Variational Autoencoders, Generative Adversarial Networks) in the pipeline.

Future Directions

44

(45)

[1] Shell Xu Hu, Da Li, Jan Stühmer, Minyoung Kim, and Timothy M Hospedales. Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a difference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9068–9077, 2022.

[2] Yuzhong Chen, Zhenxiang Xiao, Lin Zhao, Lu Zhang, Haixing Dai, David Weizhong Liu, Zihao Wu, Changhe Li, Tuo Zhang, Changying Li, et al.

Mask-guided vision transformer (mg-vit) for few-shot learning. arXiv preprint arXiv:2205.09995, 2022.

[3] Koushik Sivarama Krishnan and Karthik Sivarama Krishnan. Vision transformer based covid-19 detection using chest x-rays. In 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), pages 644–648. IEEE, 2021.

[4] Shehan Perera, Srikar Adhikari, and Alper Yilmaz. Pocformer: A lightweight transformer architecture for detection of covid-19 using point of care ultrasound. In 2021 IEEE International Conference on Image Processing (ICIP), pages 195– 199. IEEE, 2021.

[5] Linh T Duong, Nhi H Le, Toan B Tran, Vuong M Ngo, and Phuong T Nguyen. Detection of tuberculosis from chest x-ray images: boosting the performance with vision transformer and transfer learning. Expert Systems with Applications, 184:115519, 2021.

[6] Finn Behrendt, Debayan Bhattacharya, Julia Krüger, Roland Opfer, and Alexander Schlaefer. Data-efficient vision transformers for multi-label disease classification on chest radiographs. Current Directions in Biomedical Engineering, 8(1):34–37, 2022.

[7] Rishav Singh, Vandana Bharti, Vishal Purohit, Abhinav Kumar, Amit Kumar Singh, and Sanjay Kumar Singh. Metamed: Few-shot medical image classification using gradient-based meta-learning. Pattern Recognition, 120:108111, 2021.

[8] Zhiyong Dai, Jianjun Yi, Lei Yan, Qingwen Xu, Liang Hu, Qi Zhang, Jiahui Li, and Guoqiang Wang. Pfemed: Few-shot medical image classification using prior guided feature enhancement. Pattern Recognition, 134:109108, 2023.

[9] Mehdi Cherti and Jenia Jitsev. Effect of pre-training scale on intra-and inter-domain full and few-shot transfer learning for natural and medical x-ray chest images.

arXiv preprint arXiv:2106.00116, 2021.

[10] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.

[11] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.

[12] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.

[13] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

[14] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.

Reference List

45

(46)

[15] Jinyi Zou, Xiao Ma, Cheng Zhong, and Yao Zhang. Dermoscopic image analysis for isic challenge 2018. arXiv preprint arXiv:1807.08948, 2018.

[16] Fabio A Spanhol, Luiz S Oliveira, Caroline Petitjean, and Laurent Heutte. A dataset for breast cancer histopathological image classification. Ieee transactions on biomedical engineering, 63(7):1455–1462, 2015.

[17] Jan Jantzen, Jonas Norup, Georgios Dounias, and Beth Bjerregaard. Pap-smear benchmark data for pattern classification. Nature inspired Smart Information Systems (NiSIS 2005), pages 1–9, 2005.

[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[19] Sachin Mehta and Mohammad Rastegari. Separable self-attention for mobile vision transformers, 2022.

[20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.

[21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.

[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.

Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.

[25] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.

[26] Etienne Bennequin. easyfsl. https://github.com/sicara/easy-few-shot-learning, 2021.

[27] L. Cai, J. Gao, and D. Zhao, “A review of the application of deep learning in medical image classification and segmentation,” Annals of translational medicine, vol. 8, no.

11, 2020.

[28] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft convolutional inductive biases,” in International Conference on Machine Learning, 2021, pp. 2286–2296.

[29] M. Aouayeb, W. Hamidouche, C. Soladie, K. Kpalma, and R. Seguier, “Learning vision transformer with squeeze and excitation for facial expression recognition,” arXiv preprint arXiv:2107.03107, 2021.

Reference List

46

(47)

www.nu.edu.kz

Q&A

47

Сурет

Fig. 1. Deep learning application in medical image  analysis. (A) Fundus detection; (B,C) hippocampus  segmentation; (D) left ventricular segmentation; (E)  pulmonary nodule classification; (F,G,H,I) gastric cancer  pathology segmentation
Fig 2. N-way-K-shot task example (3-way-3-shot)
Fig 3. Overall Pipeline 18
Fig. 4. Architecture of Custom ViTs. Figure is adapted from  [18], [28], and [29].

Ақпарат көздері

СӘЙКЕС КЕЛЕТІН ҚҰЖАТТАР

Figure 3.3 Sensing mechanism of FBGs: (a) schematic of 5 gratings inscribed in a fiber, (b) reflected spectra comprising wavelengths reflected by all FBGs, (c) the shift of each

Major topics to solve the earlier mentioned problem statements covered in this thesis are summarized as the following: • Development of mixture proportions of NAAC consisting of

The current study sought to address the knowledge gap by examining cross-sectionaly demographic data, prevalence of mental health disorders, suicide ideation and suicide attempt among