With the advancement of Large Language Models like ChatGPT and DALL-E and the rise in popularity of generative Artificial Intelligence, generating content like a human is no more a dream. Everything is now feasible, including question answering, code completion, and generation of content from textual descriptions, as well as the creation of images from both text and images. Recently, AI has been on par with human ingenuity. The well-known chatbot developed by OpenAI, called ChatGPT, is based on GPT 3.5’s transformer architecture and is being used by almost everyone. The latest version of GPT, i.e., GPT 4, is multimodal in nature, unlike the previous version, GPT 3.5, which only lets ChatGPT take textual inputs.
The quality of generative content has significantly increased as a result of the development of diffusion models. Because of these developments, Artificial Intelligence Generative Content (AIGC) platforms, like DALLE, Stability AI, Runway, and Midjourney, have become increasingly popular as these systems let users create high-quality images based on text prompts provided in natural language. Despite advances in multimodal understanding, vision-language models still have difficulty understanding generated visuals. In comparison to real data, synthetic images display a larger degree of content and style variability, making it far more challenging for models to understand them properly.
To address these issues, a team of researchers has introduced JourneyDB, a large-scale dataset specifically curated for multimodal visual understanding of generative images. JourneyDB has 4 million unique, high-quality generated photos that have been created using different text prompts. This dataset focuses on both content and style interpretation and seeks to offer a complete resource for training and assessing models’ abilities to comprehend generated images.
The four tasks included in the suggested benchmark are as follows.
Prompt inversion – Prompt inversion has been used to find the text prompts that the user used to generate an image. This tests the model’s comprehension of the generated images’ content and style.
Style retrieval – The team has focused on style retrieval so that the model identifies and retrieves similar generative images based on their stylistic attributes. This assesses the model’s proficiency in discerning stylistic nuances within generative images.
Image captioning – In image captioning, the model is tasked with generating descriptive captions that accurately represent the content of the generative image, which thus evaluates the model’s capability to comprehend and express the visual elements of the generated content effectively in natural language.
Visual Question Answering – Through Visual Question Answering (VQA), the model provides accurate answers to questions related to the generative image. The model is able to comprehend the visual and style content and provide relevant responses based on the given questions.
The team gathered 4,692,751 image-text prompt pairs and divided them into three sets: a training set, a validation set, and a test set. For evaluation, the team conducted extensive experiments using the benchmark dataset. The results showed that current state-of-the-art multimodal models don’t perform as well as they do on real datasets, but a few adjustments on the proposed dataset greatly improved their performance.
Check out the Paper, Code, and Project. Don’t forget to join our 25k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Check Out 100’s AI Tools in AI Tools Club
The post Meet JourneyDB: A Large Scale Dataset with 4 Million Diverse and High-Quality Generated Images Curated for Multimodal Visual Understanding appeared first on MarkTechPost.