Kandinsky 2.1 is the latest advance from Sber AI, a neural network that can generate high quality images in seconds from natural language prompts. It can also create new visuals by blending and adjusting multiple drawings according to detailed descriptions, fill in missing parts of scenes, and produce images in endless modes across canvases for both interior and exterior painting tasks. The bank’s press service announced the rollout.
Developed and trained by Sber AI researchers in collaboration with scientists from the AIRI Artificial Intelligence Institute, Kandinsky 2.1 builds on the solid groundwork of its predecessor. The project relied on a joint dataset assembled by Sber AI and SberDevices, designed to push the model toward broader capabilities and practical utility. The new version integrates fresh data and refined training strategies to expand its real-world applications while preserving the strengths that came from earlier work.
The model inherits the weights of the prior Kandinsky iteration, which was trained on a wide foundation of image and text pairs. The original training set included a billion paired examples, with a significant portion composed of high resolution text-image pairs to sharpen the model’s ability to render fine details and complex textures. This broader exposure informs Kandinsky 2.1’s versatility across different genres and subjects.
User feedback during development played a pivotal role. The team pursued bold ideas and tested unconventional concepts to craft a flexible, capable solution that can tackle a wide range of tasks with performance levels approaching the best global equivalents. Alexander Vedyakhin, a senior executive at Sberbank, highlighted how this technology could transform creative workflows, industrial design processes, and public access to sophisticated AI tools. The focus remained on delivering tangible benefits for business users and the general public alike.
A new autoencoder model accompanies Kandinsky 2.1, serving as the decoder for vector representations of images. This upgrade contributes to higher resolution rendering and smoother detail transfer, enabling sharper outputs at larger scales. Additionally, Kandinsky 2.1 uses a dual input strategy that blends traditional text prompts with an image-based representation generated by a CLIP-like mechanism. This approach allows the system to grasp the requested concept through textual cues and then translate that understanding into a vivid, coherent visual result for the main generative network.
By leveraging this combined representation, Kandinsky 2.1 better aligns generated visuals with user intent, producing images that reflect nuanced descriptions while maintaining artistic coherence. The system is designed to handle a broad range of styles, from lifelike depictions to abstract compositions, and to adapt to the varying requirements of interior designers, digital artists, advertisers, and researchers. As a result, creators can explore new ideas quickly, iterate on concepts with minimal effort, and access high fidelity imagery suitable for diverse applications.