A Study on Text-to-image Model-based Dataset for Image Classification
Dabin Kang1
Chae-yeong Song1
Dong-hun Lee1
Dong-shin Lim1
Sang-hyo Park1*
-
(School of Computer Science and Engineering, Kyungpook National University, Daegu,
Korea {zefeni, yeongsong, hy05205, dslim, s.park}@knu.ac.kr)
Copyright © 2026 The Institute of Electronics and Information Engineers(IEIE)
Keywords
Image classification, Computer vision, Generative model, Object detection
1. Introduction
Text-to-image models are designed to generate images that correspond to the content
of text descriptions used as inputs. Following the advancements in diffusion models,
the development of text-to-image models has accelerated [1-
3]. Imagen [1] utilizes a large-scale transformer-based language model to comprehend text and leverages
the strengths of a diffusion model to produce high-fidelity images. Stable Diffusion
[2] trains a diffusion model in a learned latent space, allowing for efficient image
generation from this space in a single network pass. DALLE-3 [3], based on the diffusion-driven DALLE-2 [4], improves inaccurate captions in the training dataset to generate images that better
match the prompts. The development of such text-to-image models has made it easier
and more convenient to create images that align with user intentions and are of high
quality.
Moreover, as the capability to generate high-quality images has improved, research
has not only focused on the creation of images but also on utilizing these generated
images in deep learning models to further enhance their applications. DIRE (Diffusion
Reconstruction Error) [12] measures the error between input images and their reconstructed counterparts generated
by diffusion models. LGIQA [14] was developed to assess the quality of generated images. GenImage [13] is a benchmark dataset created using diffusion models and generative adversarial
networks to synthesize images for the detection of generated images. However, previous
studies have primarily concentrated on comparing generated images with existing images,
with minimal efforts directed toward utilizing these generated images as an alternative
to image classification datasets.
Therefore, this study has generated an image dataset using text-to-image models and
assessed the accuracy of image classification. The generation of images via text-to-image
models is significantly influenced by the composition of the prompts. Unlike prior
research [19] that directly crafted text and qualitatively evaluated object detection, this study
employed texts from Large Language Model (LLM) like ChatGPT4 [17] and COCO2017 [10] captions for constructing texts and utilized classification for evaluation, significantly
enhancing the objectivity and reliability of results. Furthermore, while constructing
a generated image dataset for ten animal categories adopted from the COCO2017 dataset,
this study evaluated the appropriateness for the classification task using the CLIP
Score [15] to refine the dataset. Additionally, to prevent issues such as resolution degradation,
a deblurred dataset was also developed and tested, where deblurring processing was
applied to the generated images. This research contributes by generating datasets
from various aspects: the type of generative models, prompt composition methods, and
refining strategy through both qualitative and quantitative measures. Furthermore,
this work is significant as it employs a universal and objective approach by conducting
classification without the consideration of bounding boxes, eliminating subjective
interpretations of accuracy.
The main contributions of this study are summarized as follows:
-
Unlike previous studies, this research utilizes text-to-image models and text inputs
from LLM and COCO2017 captions to generate an image dataset. The generated images
are labeled appropriately to suit image classification tasks.
-
The study introduces a framework that employs the CLIP Score to refine the generated
images by ranking them based on class relevance.
-
The case study is conducted to evaluate the utility of the generated image dataset
for image classification.
2. Related Work
2.1. Text-to-Image Generation Models
Text-to-image models aim to generate images based on textual descriptions. Prominent
examples of such models include DALLE-3 [3] and Stable Diffusion [2]. Both models employ diffusion-based methods, but they differ in their approaches;
DALLE-3 [3] directly predicts pixel values directly using a diffusion model, while Stable Diffusion
[2] uses an autoencoder. Images created by DALLE-3 often have a surrealistic quality
due to their prompt following capabilities [3], whereas Stable Diffusion tends to produce more realistic images. The outcome of
these generative models can vary significantly depending on the construction of the
prompt [5,
6]. Therefore, both the type of text-to-image model and the prompt composition affect
the diversity and quality of the images. Many explorations have been made to optimize
prompt composition using conversational LLMs [5] or by adapting user inputs to preferred prompts automatically [6]. This paper aims to use these generative models to construct diverse datasets by
employing various text-to-image models and prompt configurations.
2.2. Data for Image Classification
In image classification, commonly used datasets include CIFAR-100 [7], CIFAR-10 [7], Fashion-MNIST [8], ImageNet [9], and MSCOCO [10]. CIFAR-100 and CIFAR-10 are subsets of the Tiny Images dataset [11], consisting of 60,000 32x32 color images with 100 and 10 classes, respectively. Fashion-MNIST
has 70,000 28x28 grayscale images categorized into 10 fashion classes, while ImageNet
contains approximately 14 million color images of various resolutions, organized into
about 20,000 categories. MSCOCO is composed of 328,000 color images in 80 categories
including animals, people, and other subjects. The most recent update, COCO2017 [10], has expanded the dataset from the previous version, changing the training/validation
split from 83,000/41,000 to 118,000/5,000. Datasets constructed from real-world images
are often categorized into multiple classes based on established criteria. However,
datasets built from generated images focus primarily on distinguishing between generated
and real images. While metrics such as DIRE [12] and LGIQA [14] are used for differentiating between generated and real images, and datasets like
GenImage [13] exist, they were not proposed for use in image classification tasks. Building datasets
from generated images facilitates easy labeling and can help supplement data deficiencies
in specific categories. Thus, this paper presents a dataset constructed solely from
generated images, categorized into real-world classes.
3. Data
In this study, the two-step process was conducted using the text-to-image model as
shown in Fig. 1: first, generating a dataset (Subsection 3.1), and then evaluating it (Subsection
3.2). Additionally, experiments on the image classification task are detailed, dividing
datasets into base and deblurred categories. The potential is also explored for extension
to segmentation tasks. A preview of the generated dataset can be seen in Fig. 2.
3.1. Data Generation
3.1.1 Environment
This chapter introduces the models used for constructing and preprocessing the empirical
data. Among the representative text-to-image models, this study utilized the diffusion-based
models Stable Diffusion [2] and DALLE-3 [3]. To generate the dataset, this paper used Stable Diffusion on a T4 GPU and DALLE-3
within the Bing Image Creator, a free tool provided by Microsoft. Both models were
configured to generate images using their zero-shot capabilities. Restormer [16] for deblurring was implemented in a T4 GPU environment.
3.1.2 Dataset for image classification
The generated images depend on the type of model used and the composition of the prompt.
Even with the same model and prompt, different images can be generated each time.
Considering these conditions, this paper constructed a dataset by generating multiple
images using two types of models and prompt configurations. Firstly, for the generative
models, this study used Stable Diffusion [2] and DALLE-3 [3], both diffusion-based models, but with different structures. Stable Diffusion utilizes
an autoencoder, while DALLE-3 directly predicts pixel values using a diffusion model.
Notably, DALLE-3 excels in the prompt following, thus the originality of the prompt
significantly influences the images it generates. This difference becomes more pronounced
when compared to Stable Diffusion, which is particularly strong with realistic prompts.
Fig. 1. Overview of generating and filtering process.
Secondly, to leverage the differences mentioned above, the prompts are structured
using two methods: one utilizing an LLM [17] and the other using captions from the COCO2017 [10] dataset.
To construct a dataset for classification, it is necessary first to define the classes
and incorporate them into the prompt text. Experiments are conducted on ten animal
categories from the COCO2017 dataset for classification: cat, dog, horse, bird, zebra,
sheep, cow, elephant, bear, and giraffe. For constructing prompts with LLM, these
ten animal classes are defined as $C = \{c_1, c_2, c_3, \dots, c_{10}\}$. ChatGPT4
[17] was used as the LLM to insert prompts, and a sentence is employed like "Create ten
text prompts for a text-to-image model. The object must be $C$, depicting $C$ performing
human actions realistically." to generate prompts with greater freedom. While generating
original prompts through an LLM, captions from COCO2017 were used to construct reliable
prompts that have been created and verified by humans. To align with the objectives
of the classification task, it is essential to refine captions such that they specify
only one type of animal category and do not feature overlapping animal categories.
Thus, for the experiments, only those captions from COCO2017 that contained a distinct
animal category $C$ and had no overlapping categories were refined and used. Sentences
generated using the LLM or refined from captions were then used as inputs for Stable
Diffusion and DALLE-3 to generate more than four images per prompt.
In CIFAR-100 [7], a representative dataset for image classification, it is stated that the class an
image belongs to should rank at the top of the possible answers to the question "What
is in this picture?" Accordingly, datasets crafted to these standards are refined.
Firstly, the similarity between images and class labels is checked using the CLIP
Score to ensure images are categorized into the intended classes. For the previously
defined ten animal classes $C$, the class label is defined as "a photo of a $C$".
An image is considered well-generated if the class ranked highest by the CLIP Score
[15] corresponds to the intended class of the image. Secondly, images generated contrary
to the intended design were evaluated qualitatively. Images that were generated in
black or mixed with humans and animals were also refined. The process of dataset generation
and filtering can be observed in Fig. 1.
3.1.3 Deblurring dataset for image classification
Stable Diffusion [2] and DALLE-3 [3] are diffusion-based models, characterized by a process that regenerates images from
noise, which may necessitate consideration of resolution issues. There are various
methods to address resolution issues, among which deblurring, commonly used to remove
blurring that frequently occurs in real-world settings, was adopted.
Therefore, this study applied deblurring to existing generated datasets to assess
its impact when performing classification tasks. Using datasets refined through quantitative
and qualitative evaluations, deblurring was additionally carried out using the Restormer
[16] model.
3.2. Evaluating Data
3.2.1 Setup for evaluating Data
The dataset used for classification can be expanded to be applicable to segmentation
or inpainting tasks as well. To explore the scalability of the dataset for broader
applications, this study adopted YOLOv8 [18], which can be universally applied from classification tasks to segmentation. The
evaluation was conducted in an RTX 3090 GPU environment. In the same environment,
different models were utilized depending on the task; YOLOv8 was used for the classification
task, while the YOLOv8-seg model was employed for the segmentation task.
3.2.2 Image classification
Previously, the dataset was generated in various ways depending on the type of generative
model and the composition of the prompts, and it underwent both quantitative and qualitative
refinement processes. The dataset created is composed of ten animal categories and
is divided into three types: data generated using Stable Diffusion [2] with text from an LLM [17], data generated using DALLE-3 [3] with text from an LLM, and data generated using Stable Diffusion with captions from
COCO2017 [10]. To verify if the dataset can be effectively used for classification tasks as intended,
experiments were conducted using YOLOv8 [18]. While YOLOv8 is commonly used for object detection tasks, it is also capable of
performing image classification tasks with high efficiency and has been chosen considering
its potential scalability to segmentation tasks. For the image classification task
involving ten animal categories, the classes are restricted to these specific animals
and set max_det = 1 for single class classification during the evaluation. Fig. 1 demonstrates how the dataset was evaluated.
3.2.3 Ablation study for deblurring
The deblurring dataset, which was created by adding deblurring to the generated dataset,
is also subjected to the image classification task under the same conditions as described
in Subsection 3.2.2. This process allows for a more in-depth assessment of the impact
of deblurring on the classification task for generated images.
4. Results
To determine whether the generated and refined datasets could be practically used
in image classification, image classification tests were performed separately on the
original generated base dataset and the generated deblurred dataset with added deblurring.
The experiments were conducted considering the type of generative model and the composition
of the prompts used to create the images, and the results were tabulated. Images for
ten different animal categories were generated using various models and prompt types,
and the dataset refined through quantitative and qualitative evaluations can be briefly
viewed in Fig. 2. The results of experiments conducted using this dataset are presented in Tables 1-4.
Table 1 presents the results of image classification performed on the generated dataset created
using prompts from a large language model (LLM). The classification accuracy for images
created using Stable Diffusion [2] and DALLE-3 [3] across ten animal categories averaged 99.18% and 87.25%, respectively, confirming
that Stable Diffusion had significantly higher classification accuracy. This trend
is also evident in Table 3, which presents the results under deblurring conditions. These results suggest that
images generated by DALLE-3, which better followed the unique prompts generated by
LLM, were relatively more challenging to classify. Table 2 shows the image classification results for the generated dataset created using captions
as prompts. With an average classification accuracy of 99.22% across ten animal categories,
the performance is comparably high and does not significantly differ from the 99.18%
achieved in Table 1 using LLM text with the same Stable Diffusion model.
Table 1. Image classification result for generated dataset based on LLM text.
|
|
Stable diffusion Base$\uparrow$ (%)
|
DALLE-3 Base$\uparrow$ (%)
|
|
cat
|
100
|
97.5
|
|
dog
|
99.28
|
85
|
|
horse
|
100
|
80
|
|
bird
|
99
|
95
|
|
zebra
|
100
|
95
|
|
sheep
|
100
|
80
|
|
cow
|
99.13
|
70
|
|
elephant
|
100
|
100
|
|
bear
|
94.4
|
75
|
|
giraffe
|
100
|
95
|
|
Average
|
99.18
|
87.25
|
Table 2. Image classification result for generated dataset based on caption text.
|
|
Stable Diffusion Base$\uparrow$ (%)
|
|
cat
|
99.4
|
|
dog
|
98.7
|
|
horse
|
99.3
|
|
bird
|
98.6
|
|
zebra
|
100
|
|
sheep
|
99.33
|
|
cow
|
99.5
|
|
elephant
|
99.9
|
|
bear
|
98
|
|
giraffe
|
99.5
|
|
Average
|
99.22
|
Table 3. Image classification result for generated deblur dataset based on LLM text.
|
|
Stable Diffusion +Deblur$\uparrow$ (%)
|
DALLE-3 +Deblur$\uparrow$ (%)
|
|
cat
|
100
|
97.5
|
|
dog
|
99.29
|
87.5
|
|
horse
|
100
|
82.5
|
|
bird
|
100
|
95
|
|
zebra
|
100
|
95
|
|
sheep
|
100
|
77.5
|
|
cow
|
100
|
70
|
|
elephant
|
100
|
100
|
|
bear
|
97.2
|
77.5
|
|
giraffe
|
100
|
95
|
|
Average
|
99.65
|
87.75
|
Table 4. Image classification result for generated deblur dataset based on caption
text.
|
|
Stable Diffusion +Deblur$\uparrow$ (%)
|
|
cat
|
99.5
|
|
dog
|
98.5
|
|
horse
|
99.4
|
|
bird
|
98.6
|
|
zebra
|
100
|
|
sheep
|
99.33
|
|
cow
|
99.5
|
|
elephant
|
99.9
|
|
bear
|
98.2
|
|
giraffe
|
99.5
|
|
Average
|
99.3
|
Tables 3 and 4 present the results of evaluations performed with added deblurring, considering the
resolution due to the nature of the two generative models that regenerate images from
noise. When compared with data not subjected to deblurring under the same conditions,
the results from Stable Diffusion showed a 0.47% increase in accuracy based on LLM
text and a 0.08% increase in accuracy based on caption text. Additionally, results
from DALLE-3 showed a decrease of 0.5% in accuracy based on LLM text. This study found
that deblurring had a modest effect on improving classification performance for images
generated by Stable Diffusion, which excels at creating realistic images. However,
it had no significant effect on images generated by DALLE-3, known for its strong
prompt-following and originality. This indicates that deblurring techniques are more
suited for realistic images and may reduce performance when applied to generated datasets
that strongly follow prompts.
Failure cases in image classification tasks can be seen in Fig. 3. A failure case occurs when the class evaluated qualitatively and the class assessed
by the YOLOv8 model differ, meaning the image was misclassified. Upon examining the
example images, they visually appear to be classified as cat, dog, sheep, and zebra,
consistent with the ground truth. However, the predictions incorrectly classify them
as dog, cat, dog, and None, indicating misclassifications or the absence of the corresponding
object classes.
Fig. 3. Failure cases of classification.
Furthermore, to assess scalability in other tasks, an image segmentation test was
also conducted using the YOLOv8-seg model. Partial results of the segmentation can
be visually examined in Fig. 4. Upon reviewing the example images, considering the variety of animal categories,
the generative models used, and the diversity of prompts, they consistently exhibit
high performance, demonstrating potential for use as an actual dataset.
Fig. 4. Visual results of segmentation task.
5. Conclusion
This study presents a novel framework to construct an image classification dataset
by using images generated from text-to-image models, employing textual inputs from
LLM prompts and COCO2017 captions. Unlike traditional datasets, this research leverages
generative models to produce labeled image data specifically tailored for classification
tasks, offering a cost-effective and scalable solution. The proposed framework refines
the dataset generated by ranking images based on class relevance using quantitative
metric such as CLIP Score. This approach significantly enhances data quality while
reducing uncertainty. Evaluation results indicate that DALLE-3, which faithfully generates
images according to unique LLM prompts, relatively complicates image classification.
Moreover, the study notes that the impact of deblurring generated images is generally
negative and that a specialized approach is necessary to improve the resolution of
generated images. This research suggests that the construction of the generated dataset
can be universally applied and potentially expanded from classification tasks to segmentation
tasks in future studies. By demonstrating the utility and limitations of using generative
models for dataset generation, this study contributes to advancing methodologies in
image classification.
Acknowledgment
This work was supported by the National Research Foundation of Korea (NRF) grant funded
by the Korea government(MSIT)(RS-2025-00520308).
References
Saharia C. , Chan W. , Saxena S. , 2022, Photorealistic text-to-image diffusion
models with deep language understanding, Advances in Neural Information Processing
Systems, Vol. 35, pp. 36479-36494

Blattmann R. Rombach A. , Lorenz D. , Esser P. , Ommer B. , 2022, High-resolution
image synthesis with latent diffusion models, pp. 10674-10685

Betker J. , Goh G. , Jing L. , Brooks T. , Wang J. , Li L. , Zhung L.
Ouyang J. , Lee J. , Guo Y. , Manassra W. , Dhariwal P. , Chu C. , Jiao
Y. , 2023, Improving image generation with better captions

Ramesh A. , Dhariwai P. , Nichol A. , Chu C. , Chen M. , 2022, Hierarchical
text-conditional image generation with CLIP latents, arXiv preprint arXiv:2204.06125

Brade S. , Wang B. , Sousa M. , Oore S. , Grossman T. , 2023, Promptify:
Text-to-image generation through interactive prompt exploration with large language
models

Hao Y. , Chi Z. , Dong L. , Wei F. , 2024, Optimizing prompts for text-to-image
generation, Advances in Neural Information Processing Systems, Vol. 36

Krizhevsky A. , Hinton G. , 2009, Learning multiple layers of features from tiny
images

Xiao H. , Rasul K. , Vollgraf R. , 2017, Fashion-MNIST: A novel image dataset
for benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747

Deng J. , Dong W. , Socher R. , Li L.-J. , Li K. , Li F.-F. , 2009, ImageNet:
A large-scale hierarchical image database

Lin T.-Y. , Maire M. , Belongie S. , Hays J. , Perona P. , Ramanan D.
, Dollár P. , Zitnick C. K. , 2014, Microsoft COCO: Common objects in context,
Vol. 13, pp. 740-755

Torralba A. , Fergus R. , Freeman W. T. , 2008, 80 million tiny images: A large
data set for nonparametric object and scene recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 30, No. 11, pp. 1958-1970

Wang Z. , Bao J. , Zhou W. , Wnag W. , Hu H. , Chen H. , Li H. , 2023,
Dire for diffusion-generated image detection

Zhu M. , Chen H. , Yan Q. , Huang X. , Lin G. , Li W. , Tu Z. , Hu
H. , Hu J. , Wang Y. , 2024, GenImage: A million-scale benchmark for detecting
AI-generated image, Advances in Neural Information Processing Systems, Vol. 36

Gu S. , Bao J. , Chen D. , Wen F. , 2020, GIQA: Generated image quality assessment,
pp. 369-385

Radford A. , Kim J. W. , Hallacy C. , Ramesh A. , Goh G. , Agarwal S.
, Sastry G. , Askell A. , Mishkin P. , Clark J. , Krueger G. , Sutskever
I. , 2021, Learning transferable visual models from natural language supervision

Zamir S. W. , Arora A. , Khan S. , Hayat M. , Khan F. S. , Yang M.-H.
, 2022, Restormer: Efficient transformer for high-resolution image restoration

Achiam J. , Adler S. , 2023, GPT-4 technical report, arXiv preprint arXiv:2303.08774

Jocher G. , 2023, Ultralytics YOLO (Version 8.0.0) [Computer software]

Kang D. , Hong J. , Kim J. , Song M. , Kim D. , Park S. , 2022, A case
study of object detection via generated image using deep learning model based on image
generation, pp. 203-206

Dabin Kang is currently an M.S. student in computer science and engineering at Kyungpook
National University, South Korea. She received her B.S. degree in computer science
and engineering, Kyungpook National University, South Korea. Her research interests
include generative AI, multi-modal AI, text-to-video retrieval, 3D scene understanding,
and knowledge distillation.
Chae-yeong Song is currently an M.S. student in computer science and engineering at
Kyungpook National University, South Korea. She received her B.S. degree in business
administration with a minor in computer science and engineering from Kyungpook National
University. Her research interests include 3D and multi-modal model and Model Compression.
Dong-hun Lee is currently pursuing an integrated M.S. and Ph.D. degree in the School
of Computer Science and Engineering, Kyungpook National University, South Korea. He
received his B.S. degree from the School of Computer Science and Engineering, Kyungpook
National University, South Korea. His research interests include text-to-video retrieval,
video question & answering, 3D scene graph generation and knowledge distillation.
Dong-shin Lim is currently a Ph.D. candidate in computer science and engineering at
Kyungpook National University, South Korea. He received his B.S. degree in landscape
architecture from Pusan National University and his M.S. degree in information systems
from Yonsei University, Seoul. He is working as a researcher in the AI-Big Data Section
at the Korea Education and Research Information Service (KERIS) in Daegu, South Korea.
His research interests include video compression and video quality enhancement.
Sang-hyo Park received his Ph.D. degrees in computer science from Hanyang University,
Seoul, South Korea, in 2017. From 2017 to 2018, he held a Postdoctoral position with
the Intelligent Image Processing Center, Korea Electronics Technology Institute, and
a Research Fellow with the Barun ICT Research Center, Yonsei University in 2018. From
2019 to 2020, he held a Postdoctoral position with the Department of Electronic and
Electrical Engineering, Ewha Womans University. In 2020, he joined the Kyungpook National
University at Daegu, where he is now an Associate Professor of computer science and
engineering. His research interests include VVC, encoding complexity, scene description,
and model compression. He had served as a Co-Editor of Internet Video Coding (IVC,
ISO/IEC 14496-33) for six years.