SaSPA: Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

Computer Vision & Graphics Lab, Reichman University

SaSPA enhances fine-grained classification datasets by generating realistic, class-consistent image augmentations from the training set. This consistently leads to significant accuracy improvements.

Qualitative Comparison

Various generative augmentation methods applied on the Aircraft dataset. Text-to-image often compromises class fidelity, visible by the unrealistic aircraft design (i.e., tail at both ends). Img2Img trades off fidelity and diversity: lower strength (e.g., 0.5) introduces minimal semantic changes, resulting in higher fidelity but limited diversity, whereas higher strength (e.g., 0.75) introduces diversity but also inaccuracies such as the incorrectly added engine. In contrast, SaSPA achieves high fidelity and diversity, critical for Fine-Grained Visual Classification tasks. D - Diversity. F - Fidelity

Abstract

Fine-grained visual classification (FGVC) involves classifying closely related subcategories. This task is inherently difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting image classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on text-to-image generation or Img2Img methods, such as SDEdit, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation techniques. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data.

SaSPA Overview

For a given FGVC dataset, we generate prompts via GPT-4 based on the meta-class. Each real image undergoes edge detection to provide structural outlines. These edges are used multiple times, each time with a different prompt and a different subject reference image from the same sub-class, as inputs to a ControlNet with BLIP-Diffusion as the base model. The generated images are then filtered using a dataset-trained model and CLIP to ensure relevance and quality.

Example Augmentations

Example augmentations using our method (SaSPA). The {} placeholder represents the specific sub-class.

Results on Full FGVC Datasets

Our method produces better accuracy than both traditional and recent generative augmentation methods.

Type Augmentation Method Aircraft CompCars Cars CUB DTD
Traditional No Aug 81.4 67.0 91.8 81.5 68.5
CAL-Aug 84.9 70.5 92.4 82.5 69.7
RandAug 83.7 72.5 92.6 81.5 69.3
CutMix 81.8 66.9 91.7 81.8 69.2
CAL-Aug + CutMix 84.5 70.2 92.7 82.4 69.7
RandAug + CutMix 84.0 72.6 92.7 81.2 69.2
Generative Real Guidance 84.8 73.1 92.9 82.8 68.5
ALIA 83.1 72.9 92.6 82.0 69.1
Ours SaSPA w/o BLIP-diffusion 87.4 74.8 93.7 83.0 69.8
SaSPA 86.6 76.2 93.8 83.2 71.9
Results on full FGVC Datasets. This table presents the test accuracy of various augmentation strategies across five FGVC datasets. The highest values for each dataset are shown in bold, while the highest validation accuracies achieved by traditional augmentation methods are underlined. ALIA and Real Guidance are both recent generative augmenntation methods.

Results on Few-shot Scenerios

Few-shot test accuracy across three FGVC datasets using different augmentation methods.

BibTeX

@misc{michaeli2024advancing,
        title={Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation}, 
        author={Eyal Michaeli and Ohad Fried},
        year={2024},
        eprint={2406.14551},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }