Mobile QR Code QR CODE

2025

Reject Ratio

81.5%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 15, No. 2, p.236-244

ISSN (online) :

2287-5255

Received : 30 July 2024Revised : 21 October 2024Accepted : 31 January 2025

DOI :

10.5573/IEIESPC.2026.15.2.236

Regular Paper

A Study on Text-to-image Model-based Dataset for Image Classification

Dabin Kang¹ Chae-yeong Song¹ Dong-hun Lee¹ Dong-shin Lim¹ Sang-hyo Park¹^*

(School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea {zefeni, yeongsong, hy05205, dslim, s.park}@knu.ac.kr)

^* Corresponding Author: Sang-hyo Park, s.park@knu.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Recent advances in image generation technology have led to the active development and remarkable performance of large-scale text-to-image models. With the development of image generation models, research on applying generated images to deep learning models has also evolved. However, the majority of research has focused on the differences between generated and real images, with minimal exploration of their potential as alternatives to image classification dataset. This paper suggests a novel framework that generates an image dataset using text-to-image models with LLM and COCO2017 captions, and refines the images for classification tasks by ranking them with the CLIP Score. Two text-to-image models are employed to create datasets with generated images and their accuracy is assessed in object classification. The images were generated from multiple perspectives by varying the types of generative models and the composition of prompts, and the dataset was refined using both quantitative and qualitative methods. The results show DALLE-3, while effectively generating images from LLM prompts, poses challenges for image classification. Deblurring generally worsens image quality, indicating a need for specialized resolution enhancement methods. The study suggests that the approach to constructing generated datasets could be broadly applied, with potential extensions from classification to segmentation tasks.

Keywords

Image classification, Computer vision, Generative model, Object detection

1. Introduction

Text-to-image models are designed to generate images that correspond to the content of text descriptions used as inputs. Following the advancements in diffusion models, the development of text-to-image models has accelerated ^[1- ^3]. Imagen ^[1] utilizes a large-scale transformer-based language model to comprehend text and leverages the strengths of a diffusion model to produce high-fidelity images. Stable Diffusion ^[2] trains a diffusion model in a learned latent space, allowing for efficient image generation from this space in a single network pass. DALLE-3 ^[3], based on the diffusion-driven DALLE-2 ^[4], improves inaccurate captions in the training dataset to generate images that better match the prompts. The development of such text-to-image models has made it easier and more convenient to create images that align with user intentions and are of high quality.

Moreover, as the capability to generate high-quality images has improved, research has not only focused on the creation of images but also on utilizing these generated images in deep learning models to further enhance their applications. DIRE (Diffusion Reconstruction Error) ^[12] measures the error between input images and their reconstructed counterparts generated by diffusion models. LGIQA ^[14] was developed to assess the quality of generated images. GenImage ^[13] is a benchmark dataset created using diffusion models and generative adversarial networks to synthesize images for the detection of generated images. However, previous studies have primarily concentrated on comparing generated images with existing images, with minimal efforts directed toward utilizing these generated images as an alternative to image classification datasets.

Therefore, this study has generated an image dataset using text-to-image models and assessed the accuracy of image classification. The generation of images via text-to-image models is significantly influenced by the composition of the prompts. Unlike prior research ^[19] that directly crafted text and qualitatively evaluated object detection, this study employed texts from Large Language Model (LLM) like ChatGPT4 ^[17] and COCO2017 ^[10] captions for constructing texts and utilized classification for evaluation, significantly enhancing the objectivity and reliability of results. Furthermore, while constructing a generated image dataset for ten animal categories adopted from the COCO2017 dataset, this study evaluated the appropriateness for the classification task using the CLIP Score ^[15] to refine the dataset. Additionally, to prevent issues such as resolution degradation, a deblurred dataset was also developed and tested, where deblurring processing was applied to the generated images. This research contributes by generating datasets from various aspects: the type of generative models, prompt composition methods, and refining strategy through both qualitative and quantitative measures. Furthermore, this work is significant as it employs a universal and objective approach by conducting classification without the consideration of bounding boxes, eliminating subjective interpretations of accuracy.

The main contributions of this study are summarized as follows:

Unlike previous studies, this research utilizes text-to-image models and text inputs from LLM and COCO2017 captions to generate an image dataset. The generated images are labeled appropriately to suit image classification tasks.
The study introduces a framework that employs the CLIP Score to refine the generated images by ranking them based on class relevance.
The case study is conducted to evaluate the utility of the generated image dataset for image classification.

2. Related Work

2.1. Text-to-Image Generation Models

Text-to-image models aim to generate images based on textual descriptions. Prominent examples of such models include DALLE-3 ^[3] and Stable Diffusion ^[2]. Both models employ diffusion-based methods, but they differ in their approaches; DALLE-3 ^[3] directly predicts pixel values directly using a diffusion model, while Stable Diffusion ^[2] uses an autoencoder. Images created by DALLE-3 often have a surrealistic quality due to their prompt following capabilities ^[3], whereas Stable Diffusion tends to produce more realistic images. The outcome of these generative models can vary significantly depending on the construction of the prompt ^[5, ^6]. Therefore, both the type of text-to-image model and the prompt composition affect the diversity and quality of the images. Many explorations have been made to optimize prompt composition using conversational LLMs ^[5] or by adapting user inputs to preferred prompts automatically ^[6]. This paper aims to use these generative models to construct diverse datasets by employing various text-to-image models and prompt configurations.

2.2. Data for Image Classification

In image classification, commonly used datasets include CIFAR-100 ^[7], CIFAR-10 ^[7], Fashion-MNIST ^[8], ImageNet ^[9], and MSCOCO ^[10]. CIFAR-100 and CIFAR-10 are subsets of the Tiny Images dataset ^[11], consisting of 60,000 32x32 color images with 100 and 10 classes, respectively. Fashion-MNIST has 70,000 28x28 grayscale images categorized into 10 fashion classes, while ImageNet contains approximately 14 million color images of various resolutions, organized into about 20,000 categories. MSCOCO is composed of 328,000 color images in 80 categories including animals, people, and other subjects. The most recent update, COCO2017 ^[10], has expanded the dataset from the previous version, changing the training/validation split from 83,000/41,000 to 118,000/5,000. Datasets constructed from real-world images are often categorized into multiple classes based on established criteria. However, datasets built from generated images focus primarily on distinguishing between generated and real images. While metrics such as DIRE ^[12] and LGIQA ^[14] are used for differentiating between generated and real images, and datasets like GenImage ^[13] exist, they were not proposed for use in image classification tasks. Building datasets from generated images facilitates easy labeling and can help supplement data deficiencies in specific categories. Thus, this paper presents a dataset constructed solely from generated images, categorized into real-world classes.

3. Data

In this study, the two-step process was conducted using the text-to-image model as shown in Fig. 1: first, generating a dataset (Subsection 3.1), and then evaluating it (Subsection 3.2). Additionally, experiments on the image classification task are detailed, dividing datasets into base and deblurred categories. The potential is also explored for extension to segmentation tasks. A preview of the generated dataset can be seen in Fig. 2.

3.1. Data Generation

3.1.1 Environment

This chapter introduces the models used for constructing and preprocessing the empirical data. Among the representative text-to-image models, this study utilized the diffusion-based models Stable Diffusion ^[2] and DALLE-3 ^[3]. To generate the dataset, this paper used Stable Diffusion on a T4 GPU and DALLE-3 within the Bing Image Creator, a free tool provided by Microsoft. Both models were configured to generate images using their zero-shot capabilities. Restormer ^[16] for deblurring was implemented in a T4 GPU environment.

3.1.2 Dataset for image classification

The generated images depend on the type of model used and the composition of the prompt. Even with the same model and prompt, different images can be generated each time. Considering these conditions, this paper constructed a dataset by generating multiple images using two types of models and prompt configurations. Firstly, for the generative models, this study used Stable Diffusion ^[2] and DALLE-3 ^[3], both diffusion-based models, but with different structures. Stable Diffusion utilizes an autoencoder, while DALLE-3 directly predicts pixel values using a diffusion model. Notably, DALLE-3 excels in the prompt following, thus the originality of the prompt significantly influences the images it generates. This difference becomes more pronounced when compared to Stable Diffusion, which is particularly strong with realistic prompts.

Fig. 1. Overview of generating and filtering process.

Fig. 2. Dataset preview.

Secondly, to leverage the differences mentioned above, the prompts are structured using two methods: one utilizing an LLM ^[17] and the other using captions from the COCO2017 ^[10] dataset.

To construct a dataset for classification, it is necessary first to define the classes and incorporate them into the prompt text. Experiments are conducted on ten animal categories from the COCO2017 dataset for classification: cat, dog, horse, bird, zebra, sheep, cow, elephant, bear, and giraffe. For constructing prompts with LLM, these ten animal classes are defined as $C = \{c_1, c_2, c_3, \dots, c_{10}\}$. ChatGPT4 ^[17] was used as the LLM to insert prompts, and a sentence is employed like "Create ten text prompts for a text-to-image model. The object must be $C$, depicting $C$ performing human actions realistically." to generate prompts with greater freedom. While generating original prompts through an LLM, captions from COCO2017 were used to construct reliable prompts that have been created and verified by humans. To align with the objectives of the classification task, it is essential to refine captions such that they specify only one type of animal category and do not feature overlapping animal categories. Thus, for the experiments, only those captions from COCO2017 that contained a distinct animal category $C$ and had no overlapping categories were refined and used. Sentences generated using the LLM or refined from captions were then used as inputs for Stable Diffusion and DALLE-3 to generate more than four images per prompt.

In CIFAR-100 ^[7], a representative dataset for image classification, it is stated that the class an image belongs to should rank at the top of the possible answers to the question "What is in this picture?" Accordingly, datasets crafted to these standards are refined. Firstly, the similarity between images and class labels is checked using the CLIP Score to ensure images are categorized into the intended classes. For the previously defined ten animal classes $C$, the class label is defined as "a photo of a $C$". An image is considered well-generated if the class ranked highest by the CLIP Score ^[15] corresponds to the intended class of the image. Secondly, images generated contrary to the intended design were evaluated qualitatively. Images that were generated in black or mixed with humans and animals were also refined. The process of dataset generation and filtering can be observed in Fig. 1.

3.1.3 Deblurring dataset for image classification

Stable Diffusion ^[2] and DALLE-3 ^[3] are diffusion-based models, characterized by a process that regenerates images from noise, which may necessitate consideration of resolution issues. There are various methods to address resolution issues, among which deblurring, commonly used to remove blurring that frequently occurs in real-world settings, was adopted.

Therefore, this study applied deblurring to existing generated datasets to assess its impact when performing classification tasks. Using datasets refined through quantitative and qualitative evaluations, deblurring was additionally carried out using the Restormer ^[16] model.

3.2. Evaluating Data

3.2.1 Setup for evaluating Data

The dataset used for classification can be expanded to be applicable to segmentation or inpainting tasks as well. To explore the scalability of the dataset for broader applications, this study adopted YOLOv8 ^[18], which can be universally applied from classification tasks to segmentation. The evaluation was conducted in an RTX 3090 GPU environment. In the same environment, different models were utilized depending on the task; YOLOv8 was used for the classification task, while the YOLOv8-seg model was employed for the segmentation task.

3.2.2 Image classification

Previously, the dataset was generated in various ways depending on the type of generative model and the composition of the prompts, and it underwent both quantitative and qualitative refinement processes. The dataset created is composed of ten animal categories and is divided into three types: data generated using Stable Diffusion ^[2] with text from an LLM ^[17], data generated using DALLE-3 ^[3] with text from an LLM, and data generated using Stable Diffusion with captions from COCO2017 ^[10]. To verify if the dataset can be effectively used for classification tasks as intended, experiments were conducted using YOLOv8 ^[18]. While YOLOv8 is commonly used for object detection tasks, it is also capable of performing image classification tasks with high efficiency and has been chosen considering its potential scalability to segmentation tasks. For the image classification task involving ten animal categories, the classes are restricted to these specific animals and set max_det = 1 for single class classification during the evaluation. Fig. 1 demonstrates how the dataset was evaluated.

3.2.3 Ablation study for deblurring

The deblurring dataset, which was created by adding deblurring to the generated dataset, is also subjected to the image classification task under the same conditions as described in Subsection 3.2.2. This process allows for a more in-depth assessment of the impact of deblurring on the classification task for generated images.

4. Results

To determine whether the generated and refined datasets could be practically used in image classification, image classification tests were performed separately on the original generated base dataset and the generated deblurred dataset with added deblurring. The experiments were conducted considering the type of generative model and the composition of the prompts used to create the images, and the results were tabulated. Images for ten different animal categories were generated using various models and prompt types, and the dataset refined through quantitative and qualitative evaluations can be briefly viewed in Fig. 2. The results of experiments conducted using this dataset are presented in Tables 1-4.

Table 1 presents the results of image classification performed on the generated dataset created using prompts from a large language model (LLM). The classification accuracy for images created using Stable Diffusion ^[2] and DALLE-3 ^[3] across ten animal categories averaged 99.18% and 87.25%, respectively, confirming that Stable Diffusion had significantly higher classification accuracy. This trend is also evident in Table 3, which presents the results under deblurring conditions. These results suggest that images generated by DALLE-3, which better followed the unique prompts generated by LLM, were relatively more challenging to classify. Table 2 shows the image classification results for the generated dataset created using captions as prompts. With an average classification accuracy of 99.22% across ten animal categories, the performance is comparably high and does not significantly differ from the 99.18% achieved in Table 1 using LLM text with the same Stable Diffusion model.

Table 1. Image classification result for generated dataset based on LLM text.

	Stable diffusion Base$\uparrow$ (%)	DALLE-3 Base$\uparrow$ (%)
cat	100	97.5
dog	99.28	85
horse	100	80
bird	99	95
zebra	100	95
sheep	100	80
cow	99.13	70
elephant	100	100
bear	94.4	75
giraffe	100	95
Average	99.18	87.25

Table 2. Image classification result for generated dataset based on caption text.

	Stable Diffusion Base$\uparrow$ (%)
cat	99.4
dog	98.7
horse	99.3
bird	98.6
zebra	100
sheep	99.33
cow	99.5
elephant	99.9
bear	98
giraffe	99.5
Average	99.22

Table 3. Image classification result for generated deblur dataset based on LLM text.

	Stable Diffusion +Deblur$\uparrow$ (%)	DALLE-3 +Deblur$\uparrow$ (%)
cat	100	97.5
dog	99.29	87.5
horse	100	82.5
bird	100	95
zebra	100	95
sheep	100	77.5
cow	100	70
elephant	100	100
bear	97.2	77.5
giraffe	100	95
Average	99.65	87.75

Table 4. Image classification result for generated deblur dataset based on caption text.

	Stable Diffusion +Deblur$\uparrow$ (%)
cat	99.5
dog	98.5
horse	99.4
bird	98.6
zebra	100
sheep	99.33
cow	99.5
elephant	99.9
bear	98.2
giraffe	99.5
Average	99.3

Tables 3 and 4 present the results of evaluations performed with added deblurring, considering the resolution due to the nature of the two generative models that regenerate images from noise. When compared with data not subjected to deblurring under the same conditions, the results from Stable Diffusion showed a 0.47% increase in accuracy based on LLM text and a 0.08% increase in accuracy based on caption text. Additionally, results from DALLE-3 showed a decrease of 0.5% in accuracy based on LLM text. This study found that deblurring had a modest effect on improving classification performance for images generated by Stable Diffusion, which excels at creating realistic images. However, it had no significant effect on images generated by DALLE-3, known for its strong prompt-following and originality. This indicates that deblurring techniques are more suited for realistic images and may reduce performance when applied to generated datasets that strongly follow prompts.

Failure cases in image classification tasks can be seen in Fig. 3. A failure case occurs when the class evaluated qualitatively and the class assessed by the YOLOv8 model differ, meaning the image was misclassified. Upon examining the example images, they visually appear to be classified as cat, dog, sheep, and zebra, consistent with the ground truth. However, the predictions incorrectly classify them as dog, cat, dog, and None, indicating misclassifications or the absence of the corresponding object classes.

Fig. 3. Failure cases of classification.

Furthermore, to assess scalability in other tasks, an image segmentation test was also conducted using the YOLOv8-seg model. Partial results of the segmentation can be visually examined in Fig. 4. Upon reviewing the example images, considering the variety of animal categories, the generative models used, and the diversity of prompts, they consistently exhibit high performance, demonstrating potential for use as an actual dataset.

Fig. 4. Visual results of segmentation task.

5. Conclusion

This study presents a novel framework to construct an image classification dataset by using images generated from text-to-image models, employing textual inputs from LLM prompts and COCO2017 captions. Unlike traditional datasets, this research leverages generative models to produce labeled image data specifically tailored for classification tasks, offering a cost-effective and scalable solution. The proposed framework refines the dataset generated by ranking images based on class relevance using quantitative metric such as CLIP Score. This approach significantly enhances data quality while reducing uncertainty. Evaluation results indicate that DALLE-3, which faithfully generates images according to unique LLM prompts, relatively complicates image classification. Moreover, the study notes that the impact of deblurring generated images is generally negative and that a specialized approach is necessary to improve the resolution of generated images. This research suggests that the construction of the generated dataset can be universally applied and potentially expanded from classification tasks to segmentation tasks in future studies. By demonstrating the utility and limitations of using generative models for dataset generation, this study contributes to advancing methodologies in image classification.

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT)(RS-2025-00520308).

References

Saharia C. , Chan W. , Saxena S. , 2022, Photorealistic text-to-image diffusion models with deep language understanding, Advances in Neural Information Processing Systems, Vol. 35, pp. 36479-36494

Blattmann R. Rombach A. , Lorenz D. , Esser P. , Ommer B. , 2022, High-resolution image synthesis with latent diffusion models, pp. 10674-10685

Betker J. , Goh G. , Jing L. , Brooks T. , Wang J. , Li L. , Zhung L. Ouyang J. , Lee J. , Guo Y. , Manassra W. , Dhariwal P. , Chu C. , Jiao Y. , 2023, Improving image generation with better captions

Ramesh A. , Dhariwai P. , Nichol A. , Chu C. , Chen M. , 2022, Hierarchical text-conditional image generation with CLIP latents, arXiv preprint arXiv:2204.06125

Brade S. , Wang B. , Sousa M. , Oore S. , Grossman T. , 2023, Promptify: Text-to-image generation through interactive prompt exploration with large language models

Hao Y. , Chi Z. , Dong L. , Wei F. , 2024, Optimizing prompts for text-to-image generation, Advances in Neural Information Processing Systems, Vol. 36

Krizhevsky A. , Hinton G. , 2009, Learning multiple layers of features from tiny images

Xiao H. , Rasul K. , Vollgraf R. , 2017, Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747

Deng J. , Dong W. , Socher R. , Li L.-J. , Li K. , Li F.-F. , 2009, ImageNet: A large-scale hierarchical image database

Lin T.-Y. , Maire M. , Belongie S. , Hays J. , Perona P. , Ramanan D. , Dollár P. , Zitnick C. K. , 2014, Microsoft COCO: Common objects in context, Vol. 13, pp. 740-755

Torralba A. , Fergus R. , Freeman W. T. , 2008, 80 million tiny images: A large data set for nonparametric object and scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, No. 11, pp. 1958-1970

Wang Z. , Bao J. , Zhou W. , Wnag W. , Hu H. , Chen H. , Li H. , 2023, Dire for diffusion-generated image detection

Zhu M. , Chen H. , Yan Q. , Huang X. , Lin G. , Li W. , Tu Z. , Hu H. , Hu J. , Wang Y. , 2024, GenImage: A million-scale benchmark for detecting AI-generated image, Advances in Neural Information Processing Systems, Vol. 36

Gu S. , Bao J. , Chen D. , Wen F. , 2020, GIQA: Generated image quality assessment, pp. 369-385

Radford A. , Kim J. W. , Hallacy C. , Ramesh A. , Goh G. , Agarwal S. , Sastry G. , Askell A. , Mishkin P. , Clark J. , Krueger G. , Sutskever I. , 2021, Learning transferable visual models from natural language supervision

Zamir S. W. , Arora A. , Khan S. , Hayat M. , Khan F. S. , Yang M.-H. , 2022, Restormer: Efficient transformer for high-resolution image restoration

Achiam J. , Adler S. , 2023, GPT-4 technical report, arXiv preprint arXiv:2303.08774

Jocher G. , 2023, Ultralytics YOLO (Version 8.0.0) [Computer software]

Kang D. , Hong J. , Kim J. , Song M. , Kim D. , Park S. , 2022, A case study of object detection via generated image using deep learning model based on image generation, pp. 203-206

Dabin Kang

Dabin Kang is currently an M.S. student in computer science and engineering at Kyungpook National University, South Korea. She received her B.S. degree in computer science and engineering, Kyungpook National University, South Korea. Her research interests include generative AI, multi-modal AI, text-to-video retrieval, 3D scene understanding, and knowledge distillation.

Chae-yeong Song

Chae-yeong Song is currently an M.S. student in computer science and engineering at Kyungpook National University, South Korea. She received her B.S. degree in business administration with a minor in computer science and engineering from Kyungpook National University. Her research interests include 3D and multi-modal model and Model Compression.

Dong-hun Lee

Dong-hun Lee is currently pursuing an integrated M.S. and Ph.D. degree in the School of Computer Science and Engineering, Kyungpook National University, South Korea. He received his B.S. degree from the School of Computer Science and Engineering, Kyungpook National University, South Korea. His research interests include text-to-video retrieval, video question & answering, 3D scene graph generation and knowledge distillation.

Dong-shin Lim

Dong-shin Lim is currently a Ph.D. candidate in computer science and engineering at Kyungpook National University, South Korea. He received his B.S. degree in landscape architecture from Pusan National University and his M.S. degree in information systems from Yonsei University, Seoul. He is working as a researcher in the AI-Big Data Section at the Korea Education and Research Information Service (KERIS) in Daegu, South Korea. His research interests include video compression and video quality enhancement.

Sang-hyo Park

Sang-hyo Park received his Ph.D. degrees in computer science from Hanyang University, Seoul, South Korea, in 2017. From 2017 to 2018, he held a Postdoctoral position with the Intelligent Image Processing Center, Korea Electronics Technology Institute, and a Research Fellow with the Barun ICT Research Center, Yonsei University in 2018. From 2019 to 2020, he held a Postdoctoral position with the Department of Electronic and Electrical Engineering, Ewha Womans University. In 2020, he joined the Kyungpook National University at Daegu, where he is now an Associate Professor of computer science and engineering. His research interests include VVC, encoding complexity, scene description, and model compression. He had served as a Co-Editor of Internet Video Coding (IVC, ISO/IEC 14496-33) for six years.