- Stable Cascade is a newly released non-commercial text generation image model based on the Würstchen architecture. It adopts a three-stage approach and is easy to train and fine-tune on consumer-grade hardware.
- Stable Cascade, an innovative text-to-image model built on the Würstchen architecture, uses a unique three-stage approach to simplify training and fine-tuning on consumer hardware, achieving high-quality outputs with hierarchical compression.
- Stable Cascade extends its capabilities beyond standard text-to-image generation by offering image variations, image-to-image generations, and comprehensive training scripts for ControlNet and LoRA, showcasing its flexibility and versatility.
Stable Cascade is an innovative text generation image model that achieves high-quality output within a compressed image space through a unique three-stage architecture while reducing hardware requirements. The model and associated training scripts are available on the Stability GitHub page and support further customization and experimentation.
A new era in text-to-image generation
Stable Cascade, built on the Würstchen architecture, is an innovative text-to-image model released in a research preview with a non-commercial license. This model features a unique three-stage approach, simplifying the training and fine-tuning process on consumer hardware. The release includes checkpoints, inference scripts, and additional training scripts for ControlNet and LoRA, all available on the Stability GitHub page. This model is also accessible for inference via the diffusers library. By focusing on a hierarchical compression of images, Stable Cascade achieves high-quality outputs with a highly compressed latent space, setting new benchmarks for quality and efficiency in text-to-image generation.
Also read: Stability AI Levels Up Image Generation With New Stable Diffusion Base Model
Also read: Stability AI CEO Emad Mostaque resigns to pursue decentralized AI
Unveiling the technical details
Stable Cascade’s architecture comprises three stages, each playing a crucial role in generating high-quality images. Stage C, the Latent Generator phase, transforms user inputs into compact 24×24 latents. These are passed to Stages A and B, the Latent Decoder phases, which compress the images further, similar to the VAE’s role in Stable Diffusion but with much higher compression. This decoupling allows additional training or fine-tuning, including ControlNets and LoRAs, on Stage C alone, reducing costs by 16 times compared to similar-sized Stable Diffusion models. The modular approach ensures efficient training and inference, making it a significant advancement in the field.
Beyond text-to-image generation
Stable Cascade extends its capabilities beyond standard text-to-image generation, offering image variations and image-to-image generations. By extracting image embeddings from a given image using CLIP, the model can generate multiple variations of the original image. This feature showcases the model’s flexibility and versatility. Additionally, the release includes training and fine-tuning scripts for ControlNet and LoRA, enabling users to experiment further with the architecture. Specific ControlNets for inpainting and outpainting are also provided, highlighting the model’s potential for creative and practical applications.
Community and non-commercial focus
Stable Cascade is currently available for non-commercial use only. However, Stability AI offers other image models for commercial purposes through their Membership page or Developer Platform. The release encourages community engagement and experimentation, with all training and inference code available on the Stability GitHub page. Stability AI invites users to stay updated on their progress through social media platforms like Twitter, Instagram, LinkedIn, and their Discord Community. This approach fosters a collaborative environment, aiming to advance the field of text-to-image generation while maintaining accessibility and innovation.