Google has introduced an artificial intelligence system called IMAGE, a model that can produce highly realistic images from brief text prompts. IMAGE blends simple phrases into vivid pictures, such as a tiny cactus wearing a straw hat and neon sunglasses in the Sahara, or a Pomeranian perched on a king’s throne with a crown and two tiger guards, all rendered as images.
Key to this capability is Text to Text Transformer, a technology first showcased in 2020 that originally joined text inputs and outputs. In the latest iteration, this approach has been adapted to generate visuals directly from textual descriptions.
Early stages of this technology often yielded images at a modest resolution, commonly 64 by 64 pixels. The system now scales up step by step to higher resolutions, first to 256 by 256 and then to 1024 by 1024, using a diffusion-based generation process to build images progressively.
Imagen follows an approach that seeks to outperform some other text-to-image tools in detail and fidelity. Tools such as VQ-GAN plus CLIP and DALL-E 2 are part of the landscape, and Imagen is positioned to deliver sharper textures and more accurate spatial relationships in the generated scenes.
To evaluate its capabilities, a comprehensive benchmark called DrawBench measures how well text descriptions are translated into visuals. The benchmark looks at factors such as how well the composition matches the description, the fidelity of the rendering, object count, and the spatial relationships among scene elements. The developers highlight progress in imaging research, including a more efficient U-Net architecture that reduces compute and memory demands during image synthesis.
still under development
At present, Imagen is not open source or widely accessible. The decision reflects concerns about potential misuse and the need to manage safety risks as the technology evolves. Early testing benefits from large-scale data available on the internet, which accelerates algorithmic improvements, but many aspects still require refinement.
There is acknowledgment that the data used for testing may not fully reflect real-world diversity and could carry biased or harmful connotations tied to stereotypes and marginalized groups. Despite built-in filters used to screen the data during initial experiments, the dataset includes content from large, diverse collections that can pose safety challenges.
Internationally, another prominent player has introduced a system that converts text into realistic images. This rival technology can also edit existing images upon written instruction, including the removal or alteration of features such as shadows, reflections, and textures, illustrating the expanding range of capabilities in text-to-image generation.