a powerful, large-scale, multimodal model for Text-to-Image generation.
A large-scale, multimodal Text-to-Image generation model — fully open-source and commercial-grade.
gen-image3.0 is a state-of-the-art AI model that generates high-quality images from textual descriptions. It uses a unified autoregressive multimodal framework, meaning it deeply understands both text and image data to create visually compelling outputs.
In simple words:
- What it is: An AI model that turns text prompts into images.
- Who made it: Your team (inspired by open-source breakthroughs).
- Why it matters: It’s large, intelligent, accurate in rendering details (including text), and fully open-source for commercial use.
- Unified Multimodal Architecture: Integrates text and image modalities for contextually rich image generation.
- Largest Open-Source Image Generation Model: ~80 billion parameters with a Mixture-of-Experts (MoE) design (13B active per token).
- World-Knowledge Reasoning: Can intelligently fill missing details using common sense.
- Ultra-Long Prompt Understanding: Handles text prompts over 1,000 characters for fine-grained scene control.
- Accurate Text Rendering: Supports precise generation of titles, logos, annotations, and multilingual text.
- Commercial Use: Fully open-source for developers and businesses (some geographic restrictions may apply).
Due to its size, gen-image3.0 requires high-end hardware:
- GPU Memory: ≥3 × 80GB VRAM (4 × 80GB recommended, e.g., NVIDIA A100/H100)
- Disk Space: 170GB for model weights
- Operating System: Linux with CUDA 12.8
- Python: 3.12+
- PyTorch: 2.7.1 with CUDA 12.8
- Install Dependencies:
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt- Optional Performance Optimizations:
pip install flash-attn==2.8.3 --no-build-isolation
pip install flashinfer-python⚡ Tip: Ensure PyTorch CUDA version matches system CUDA. First inference with FlashInfer may be slower (~10 min) due to kernel compilation.
from transformers import AutoModelForCausalLM
model_id = "./gen-image3"
kwargs = dict(
attn_implementation="sdpa", # "flash_attention_2" if installed
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
moe_impl="eager", # "flashinfer" if installed
)
model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)
prompt = "A brown and white dog is running on the grass"
image = model.generate_image(prompt=prompt, stream=True)
image.save("image.png")git clone https://github.com/kantkrishan0206-crypto/gen-image3.0.git
cd gen-image3.0
# Download weights from HuggingFace or your storage
# Run demo
python3 run_image_gen.py --model-id ./gen-image3 --prompt "Your prompt here"Command-line arguments:
| Argument | Description | Default |
|---|---|---|
| --prompt | Input text prompt | (Required) |
| --model-id | Model path | (Required) |
| --attn-impl | Attention type: sdpa / flash_attention_2 | sdpa |
| --moe-impl | MoE type: eager / flashinfer | eager |
| --image-size | Image resolution | auto |
| --save | Output image path | image.png |
- Install Gradio:
pip install gradio>=4.21.0- Configure Environment:
export MODEL_ID="path/to/your/model"
export GPUS="0,1,2,3"
export HOST="0.0.0.0"
export PORT="443"- Launch Demo:
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2- Open Web Interface:
http://localhost:443
- Manual Prompts: Describe main subject first, then environment, style, perspective, lighting, and technical parameters.
- System Prompts: Prebuilt templates can automatically enhance user inputs for better results.
- Machine Evaluation (SSAE): Scores images against text prompts using semantic alignment metrics.
- Human Evaluation (GSB): Professionals rate image quality using Good/Same/Bad comparison method.
We thank the open-source community for invaluable contributions:
- 🤗 Transformers
- 🎨 Diffusers
- 🌐 HuggingFace
- ⚡ FlashAttention
- 🚀 FlashInfer
⭐ If you like this project, give it a star!
