Skip to content

kantkrishan0206-crypto/gen-image3.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gen-image3.0

a powerful, large-scale, multimodal model for Text-to-Image generation.

image

🎨 gen-image3.0

A large-scale, multimodal Text-to-Image generation model — fully open-source and commercial-grade.


📖 Introduction

gen-image3.0 is a state-of-the-art AI model that generates high-quality images from textual descriptions. It uses a unified autoregressive multimodal framework, meaning it deeply understands both text and image data to create visually compelling outputs.

In simple words:

  • What it is: An AI model that turns text prompts into images.
  • Who made it: Your team (inspired by open-source breakthroughs).
  • Why it matters: It’s large, intelligent, accurate in rendering details (including text), and fully open-source for commercial use.

✨ Key Features

  • Unified Multimodal Architecture: Integrates text and image modalities for contextually rich image generation.
  • Largest Open-Source Image Generation Model: ~80 billion parameters with a Mixture-of-Experts (MoE) design (13B active per token).
  • World-Knowledge Reasoning: Can intelligently fill missing details using common sense.
  • Ultra-Long Prompt Understanding: Handles text prompts over 1,000 characters for fine-grained scene control.
  • Accurate Text Rendering: Supports precise generation of titles, logos, annotations, and multilingual text.
  • Commercial Use: Fully open-source for developers and businesses (some geographic restrictions may apply).

💻 System Requirements

Due to its size, gen-image3.0 requires high-end hardware:

  • GPU Memory: ≥3 × 80GB VRAM (4 × 80GB recommended, e.g., NVIDIA A100/H100)
  • Disk Space: 170GB for model weights
  • Operating System: Linux with CUDA 12.8

📦 Environment Setup

  1. Python: 3.12+
  2. PyTorch: 2.7.1 with CUDA 12.8
  3. Install Dependencies:
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
  1. Optional Performance Optimizations:
pip install flash-attn==2.8.3 --no-build-isolation
pip install flashinfer-python

⚡ Tip: Ensure PyTorch CUDA version matches system CUDA. First inference with FlashInfer may be slower (~10 min) due to kernel compilation.


🚀 Usage

Quick Start with Transformers

from transformers import AutoModelForCausalLM

model_id = "./gen-image3"

kwargs = dict(
    attn_implementation="sdpa",    # "flash_attention_2" if installed
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",              # "flashinfer" if installed
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

prompt = "A brown and white dog is running on the grass"
image = model.generate_image(prompt=prompt, stream=True)
image.save("image.png")

Local Installation & Usage

git clone https://github.com/kantkrishan0206-crypto/gen-image3.0.git
cd gen-image3.0
# Download weights from HuggingFace or your storage
# Run demo
python3 run_image_gen.py --model-id ./gen-image3 --prompt "Your prompt here"

Command-line arguments:

Argument Description Default
--prompt Input text prompt (Required)
--model-id Model path (Required)
--attn-impl Attention type: sdpa / flash_attention_2 sdpa
--moe-impl MoE type: eager / flashinfer eager
--image-size Image resolution auto
--save Output image path image.png

🎨 Interactive Gradio Demo

  1. Install Gradio:
pip install gradio>=4.21.0
  1. Configure Environment:
export MODEL_ID="path/to/your/model"
export GPUS="0,1,2,3"
export HOST="0.0.0.0"
export PORT="443"
  1. Launch Demo:
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2
  1. Open Web Interface: http://localhost:443

📝 Prompt Guide

  • Manual Prompts: Describe main subject first, then environment, style, perspective, lighting, and technical parameters.
  • System Prompts: Prebuilt templates can automatically enhance user inputs for better results.

📊 Evaluation

  • Machine Evaluation (SSAE): Scores images against text prompts using semantic alignment metrics.
  • Human Evaluation (GSB): Professionals rate image quality using Good/Same/Bad comparison method.


🙏 Acknowledgements

We thank the open-source community for invaluable contributions:

  • 🤗 Transformers
  • 🎨 Diffusers
  • 🌐 HuggingFace
  • ⚡ FlashAttention
  • 🚀 FlashInfer

⭐ If you like this project, give it a star!