SaSPA: Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

Computer Vision & Graphics Lab, Reichman University

arXiv Code NeurIPS 2024

SaSPA enhances fine-grained classification datasets by generating realistic, class-consistent image augmentations from the training set. This consistently leads to significant accuracy improvements.

Qualitative Comparison

Various generative augmentation methods applied on the Aircraft dataset. Text-to-image often compromises class fidelity, visible by the unrealistic aircraft design (i.e., tail at both ends). Img2Img trades off fidelity and diversity: lower strength (e.g., 0.5) introduces minimal semantic changes, resulting in higher fidelity but limited diversity, whereas higher strength (e.g., 0.75) introduces diversity but also inaccuracies such as the incorrectly added engine. In contrast, SaSPA achieves high fidelity and diversity, critical for Fine-Grained Visual Classification tasks. D - Diversity. F - Fidelity

Abstract

Fine-grained visual classification (FGVC) involves classifying closely related subcategories. This task is inherently difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting image classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on text-to-image generation or Img2Img methods, such as SDEdit, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation techniques. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data.

SaSPA Overview

For a given FGVC dataset, we generate prompts via GPT-4 based on the meta-class. Each real image undergoes edge detection to provide structural outlines. These edges are used multiple times, each time with a different prompt and a different subject reference image from the same sub-class, as inputs to a ControlNet with BLIP-Diffusion as the base model. The generated images are then filtered using a dataset-trained model and CLIP to ensure relevance and quality.

Results on Full FGVC Datasets

Our method produces better accuracy than both traditional and recent generative augmentation methods.

Type	Augmentation Method	Aircraft	CompCars	Cars	CUB	DTD
Traditional	No Aug	81.4	67.0	91.8	81.5	68.5
	CAL-Aug	84.9	70.5	92.4	82.5	69.7
	RandAug	83.7	72.5	92.6	81.5	69.3
	CutMix	81.8	66.9	91.7	81.8	69.2
	CAL-Aug + CutMix	84.5	70.2	92.7	82.4	69.7
	RandAug + CutMix	84.0	72.6	92.7	81.2	69.2

Generative	Real Guidance	84.8	73.1	92.9	82.8	68.5
Generative	ALIA	83.1	72.9	92.6	82.0	69.1

Ours	SaSPA w/o BLIP-diffusion	87.4	74.8	93.7	83.0	69.8
Ours	SaSPA	86.6	76.2	93.8	83.2	71.9

Results on full FGVC Datasets. This table presents the test accuracy of various augmentation strategies across five FGVC datasets. The highest values for each dataset are shown in bold, while the highest validation accuracies achieved by traditional augmentation methods are underlined. ALIA and Real Guidance are both recent generative augmenntation methods.

BibTeX

@inproceedings{ michaeli2024advancing, title={Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation}, author={Eyal Michaeli and Ohad Fried}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}, year={2024}, url={https://openreview.net/forum?id=MNg331t8Tj} }