Google Research Introduces SPAE: An AutoEncoder For Multimodal Generat …

Large Language Models (LLMs) have rapidly gained enormous popularity by their extraordinary capabilities in Natural Language Processing and Natural Language Understanding. This recent development in the field of Artificial Intelligence has revolutionized the way humans and computers interact with each other. The recent model developed by OpenAI, which has been in the headlines, is the well-known ChatGPT. Based on GPT’s transformer architecture, this model is famous for imitating humans for having realistic conversations and does everything from question answering and content generation to code completion, machine translation, and text summarization.

LLMs are exceptional at capturing deep conceptual knowledge about the world through their lexical embeddings. But researchers are still putting in efforts to make frozen LLMs capable of completing visual modality tasks when given the right visual representations as input. Researchers have been suggesting making use of a vector quantizer that maps an image to the token space of a frozen LLM, which would translate the image into a language that the LLM can comprehend, enabling the usage of LLM’s generative abilities to perform conditional image understanding and generation tasks without the need to train on image-text pairs.

To address this and facilitate this cross-modal task, a team of researchers from Google Research and Carnegie Mellon University has introduced Semantic Pyramid AutoEncoder (SPAE), an autoencoder for multimodal generation with frozen large language models. SPAE produces a lexical word sequence that carries rich semantics while retaining fine details for signal reconstruction. In SPAE, the team has combined an autoencoder architecture with a hierarchical pyramid structure, and contrary to previous approaches, SPAE encodes images into an interpretable discrete latent space, i.e., words.

The pyramid-shaped representation of the SPAE tokens has multiple scales, with the bottom layers of the pyramid prioritizing appearance representations that capture fine details for image reconstruction and the upper layers of the pyramid containing semantically central notions. This system enables dynamic adjustment of the token length to accommodate different tasks by using fewer tokens for tasks requiring knowledge and more tokens for jobs requiring generation. This model has been trained independently, without backpropagating through any language model.

To evaluate the effectiveness of SPAE, the team has conducted experiments on image understanding tasks, including image classification, image captioning, and visual question answering. The outcomes demonstrated how well LLMs can handle visual modalities and some great applications like content generation, design support, and interactive storytelling. The researchers also used in-context denoising methods to illustrate the picture-generating capabilities of LLMs.

The team has summarized the contribution as follows –

This work provides a great method for directly generating visual content using in-context learning using a frozen language model that has been trained just on language tokens.

Semantic Pyramid AutoEncoder (SPAE) has been proposed to generate interpretable representations of semantic concepts and fine-grained details. The multilingual linguistic tokens that the tokenizer generates have customizable lengths, giving it more flexibility and adaptation in capturing and communicating the subtleties of visual information.

A progressive prompting method has also been introduced, which enables the seamless integration of language and visual modalities, allowing for the generation of comprehensive and coherent cross-modal sequences with improved quality and accuracy.

The approach outperforms the state-of-the-art few-shot image classification accuracy under identical in-context conditions by an absolute margin of 25%.

In conclusion, SPAE is a significant breakthrough in bridging the gap between language models and visual understanding. It shows the remarkable potential of LLMs in handling cross-modal tasks. 

Check out the Paper. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

Check Out 800+ AI Tools in AI Tools Club

The post Google Research Introduces SPAE: An AutoEncoder For Multimodal Generation With Frozen Large Language Models (LLMs) appeared first on MarkTechPost.