# Stable Diffusion — full corpus


<!-- ===== stable-diffusion/README.md ===== -->

# LLM Wiki

An open-source template for building LLM-powered knowledge bases, following [Andrej Karpathy's "LLM Wiki" pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).

You provide raw sources. The LLM reads them, writes structured wiki pages, cross-links everything, and maintains it over time. You never edit the wiki directly — you curate sources and ask questions.

## How It Works

The system has three layers:

```
raw/              Sources you collect (articles, transcripts, notes, PDFs)
wiki/             LLM-written & maintained pages (summaries, concepts, entities, syntheses)
CLAUDE.md         Schema that tells the LLM how to structure everything
```

Three operations drive the workflow:

| Operation | Trigger | What happens |
|-----------|---------|--------------|
| **Ingest** | "ingest raw/my-source.txt" | LLM reads the source, creates a summary page, creates/updates concept and entity pages, adds cross-links, updates the index and log |
| **Query** | Ask any question | LLM searches the wiki, synthesizes an answer with citations, optionally creates a synthesis page for novel insights |
| **Lint** | "lint" or "health check" | LLM audits all pages for orphans, contradictions, missing links, incomplete sections, and low-confidence claims — fixes what it can, reports the rest |

## Quick Start

1. **Clone this repo**
   ```bash
   git clone https://github.com/YOUR_USERNAME/llm-wiki.git my-knowledge-base
   cd my-knowledge-base
   ```

2. **Customize CLAUDE.md** for your domain
   - Update the Purpose section with your topic
   - Replace the placeholder tagging taxonomy with your own categories
   - Adjust confidence level descriptions if needed
   - Everything else (workflows, page formats, linking rules) works as-is

3. **Drop sources into `raw/`**
   - Text files, transcripts, articles, notes — any plain text
   - These are immutable once added; the LLM never modifies them

4. **Tell the LLM to ingest**
   ```
   ingest raw/my-first-source.txt
   ```
   The LLM will create summary pages, concept pages, entity pages, cross-links, and update the index.

5. **Ask questions**
   ```
   What are the key differences between X and Y?
   ```
   The LLM answers from the wiki, citing specific pages.

6. **Run health checks**
   ```
   lint
   ```
   The LLM audits the wiki and fixes issues.

## Directory Structure

```
.
├── CLAUDE.md                      # Schema — the LLM's instructions
├── raw/                           # Your source documents (immutable)
└── wiki/
    ├── index.md                   # Master catalog of all pages
    ├── log.md                     # Append-only activity log
    ├── dashboard.md               # Dataview dashboard (Obsidian)
    ├── analytics.md               # Charts View analytics (Obsidian)
    ├── flashcards.md              # Spaced repetition cards
    ├── summaries/                 # One page per source document
    ├── concepts/                  # Concept and framework pages
    ├── entities/                  # People, tools, organizations, etc.
    ├── syntheses/                 # Cross-cutting analyses and comparisons
    ├── journal/                   # Research/session journal entries
    │   └── template.md            # Journal entry template
    └── presentations/             # Marp slide decks
```

## Enhancements

This template includes several extras beyond the core wiki pattern:

### Dataview Dashboard (`wiki/dashboard.md`)
Live queries that surface low-confidence pages, recent updates, concepts by tag, and pages with the most sources. Requires the [Dataview](https://github.com/blacksmithgu/obsidian-dataview) Obsidian plugin.

### Charts View Analytics (`wiki/analytics.md`)
Visual analytics with pie charts, bar charts, and word clouds. Requires the [Charts View](https://github.com/caronchen/obsidian-chartsview-plugin) Obsidian plugin.

### Mermaid Diagrams
Use Mermaid code blocks in any wiki page to create flowcharts, sequence diagrams, or concept maps. Native support in Obsidian and GitHub.

### Marp Slides (`wiki/presentations/`)
Create slide decks from markdown using [Marp](https://marp.app/). Drop presentation files in this directory.

### Research Journal (`wiki/journal/`)
Track your research sessions, experiments, or applied work with the included template. The LLM can reference journal entries when answering queries.

### Spaced Repetition (`wiki/flashcards.md`)
Flashcards in the format used by the [Spaced Repetition](https://github.com/st3v3nmw/obsidian-spaced-repetition) Obsidian plugin. Ask the LLM to generate flashcards from any wiki page.

### MCP Server
This repo works with Claude Code's MCP server capabilities. Point an MCP-compatible client at this repo and the LLM can read/write the wiki programmatically.

## Customizing for Your Domain

The schema in `CLAUDE.md` is domain-agnostic. To adapt it:

1. **Purpose** — Describe your knowledge domain in one paragraph
2. **Tagging taxonomy** — Replace placeholder categories with your own (e.g., for a cooking KB: `cuisine`, `technique`, `ingredient`, `equipment`)
3. **Confidence levels** — Adjust the descriptions to match your domain's evidence standards
4. **Entity types** — Update the entity page description to match what entities mean in your domain (people, tools, companies, etc.)
5. **Journal template** — Customize `wiki/journal/template.md` for your workflow

Everything else — page format, linking conventions, workflows, rules — is universal and works across domains.

## Example Domains

This template works for any knowledge-intensive topic:

- **Research notes** — papers, experiments, methodologies
- **Book analysis** — themes, characters, author techniques
- **Competitive analysis** — companies, products, market trends
- **Course notes** — lectures, readings, key concepts
- **Personal development** — frameworks, habits, book summaries
- **Technical documentation** — APIs, architectures, design patterns
- **Hobby deep-dives** — any subject you want to master

## License

MIT


<!-- ===== stable-diffusion/wiki/index.md ===== -->

---
title: "Stable Diffusion KB — Master Index"
type: index
updated: 2026-06-23
diffusers_version: "0.38.0"
---

# Stable Diffusion KB — Master Index

**Domain:** Stable Diffusion — open-weight latent text-to-image diffusion models (SD 1.5 / SDXL / SD3.x) and how to run, prompt, optimize, and fine-tune them.
**Corpus:** 106 provenance-stamped sources in `raw/` — the Hugging Face Diffusers docs (llms.txt-curated, the de-facto SD toolkit), the AUTOMATIC1111 web UI wiki, and Stability AI / Hugging Face model cards.
**Pages:** 16 (11 concepts · 2 entities · 1 summary · 2 syntheses) — the user ring plus the operator/developer ring.

## Concepts (core ideas + operational how-tos)

- [[concepts/what-is-stable-diffusion]] — latent diffusion explained; the model families (SD 1.4/1.5, SD 2.x, SDXL, SD3/SD3.5) and how they differ
- [[concepts/installation-and-setup]] — install `diffusers`/PyTorch, load a pipeline, first generation, device selection (CUDA/MPS)
- [[concepts/text-to-image]] — `DiffusionPipeline`/`AutoPipelineForText2Image`, `guidance_scale`, `num_inference_steps`, seeds and reproducibility
- [[concepts/image-to-image-and-inpainting]] — img2img, inpainting (and outpainting/depth2img), the `strength` parameter
- [[concepts/prompting]] — prompt construction, negative prompts, emphasis/weighting syntax
- [[concepts/sdxl]] — SDXL base+refiner two-stage, micro-conditioning, SDXL-Turbo (few-step)
- [[concepts/controlnet-and-adapters]] — ControlNet, T2I-Adapter, IP-Adapter, InstructPix2Pix
- [[concepts/schedulers-and-samplers]] — swapping schedulers, Karras sigmas, the step/quality tradeoff, LCM
- [[concepts/loras-for-inference]] — loading and blending LoRA adapters at inference (`load_lora_weights`, `set_adapters`, `fuse_lora`)
- [[concepts/optimization-and-memory]] — memory (offload/slicing/tiling) and speed (xFormers/attention backends, `torch.compile`, fp16/bf16); A1111 `--medvram`/`--xformers`
- [[concepts/fine-tuning]] — LoRA vs DreamBooth vs Textual Inversion vs Custom Diffusion: what to train and how

## Entities

- [[entities/diffusers-library]] — the Hugging Face `diffusers` library: `DiffusionPipeline`, models + schedulers, `from_pretrained`, AutoPipeline
- [[entities/automatic1111-webui]] — AUTOMATIC1111 stable-diffusion-webui: features and key launch flags

## Summaries

- [[summaries/model-and-feature-catalog]] — map of SD model versions (with Hub IDs) plus the larger Diffusers reference space this wiki maps rather than pages (other pipelines, optimization backends, quantization)

## Syntheses (decisions & casebooks)

- [[syntheses/choosing-model-and-pipeline]] — which SD model (quality vs speed vs license vs VRAM) and which task pipeline
- [[syntheses/troubleshooting-and-quality]] — symptom → cause → fix: CUDA OOM, black/NaN images, poor quality, non-reproducible seeds, slow generation

## Statistics

- **Total pages**: 16
- **Concepts**: 11 · **Entities**: 2 · **Summaries**: 1 · **Syntheses**: 2
- **Sources ingested**: 106 (raw/, immutable)
- **High confidence**: 14 · **Medium confidence**: 2 · **Low confidence**: 0

## Coverage notes

Strong: running SD with Diffusers (txt2img/img2img/inpaint/ControlNet/LoRA), schedulers, prompting, optimization, and the three main fine-tuning methods; SDXL and SD3.5 facts from model cards; A1111 web UI orientation. Spine is Diffusers `v0.38.0`; SD evolves by model release, so freshness = source fetch date (2026-06-23) and model claims are cited to their model cards.

Mapped, not paged (see [[summaries/model-and-feature-catalog]]): the full Diffusers per-pipeline/per-API reference, non-SD pipelines (Kandinsky, Würstchen, video — SVD/CogVideoX), optimization backends (ONNX/OpenVINO/Core ML/MPS), and quantization (bitsandbytes/torchao/GGUF/quanto). For live model availability, licenses, and post-date releases, use the Hugging Face Hub and web search.


<!-- ===== stable-diffusion/wiki/concepts/controlnet-and-adapters.md ===== -->

---
title: "ControlNet and Adapters (T2I-Adapter, IP-Adapter, InstructPix2Pix)"
type: concept
tags: [controlnet, t2i-adapter, ip-adapter, instructpix2pix, conditioning]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-controlnet-2.md, raw/llms_txt_doc-controlnet.md, raw/llms_txt_doc-t2i-adapter.md, raw/llms_txt_doc-ip-adapter.md, raw/llms_txt_doc-instructpix2pix.md]
---
# ControlNet and Adapters

Adapters add controllable conditioning on top of a frozen base model. Use ControlNet for structural control (edges, depth, pose), T2I-Adapter as a lighter alternative, IP-Adapter for image-prompt guidance, and InstructPix2Pix for instruction-based editing. See [[concepts/text-to-image]] and [[concepts/image-to-image-and-inpainting]].

## ControlNet

A ControlNet adds "zero convolution" layers conditioned on a structural control (canny edge, depth map, human pose, etc.). Load a `ControlNetModel`, pass it to the pipeline, and weight it with `controlnet_conditioning_scale`.

```py
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
controlnet = ControlNetModel.from_pretrained("path/to/controlnet", torch_dtype=torch.float16)
pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "path/to/base/model", controlnet=controlnet, torch_dtype=torch.float16).to("cuda")
image = pipeline(prompt, num_inference_steps=20, image=control_image).images[0]
```

For SDXL use `StableDiffusionXLControlNetPipeline` (and `...Img2ImgPipeline` / `...InpaintPipeline`) with a control such as `diffusers/controlnet-canny-sdxl-1.0`, passing the structural image to `control_image`.

**Multi-ControlNet:** pass lists of `ControlNetModel`s and scales — `StableDiffusionXLControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, ...)`, then `pipeline(prompt, image=images, controlnet_conditioning_scale=[0.5, 0.5], strength=0.7)`.

**guess_mode:** `guess_mode=True` generates from only the control input, no prompt (early `DownBlock` scaled `0.1`, `MidBlock` fully `1.0`).

## T2I-Adapter

A lightweight adapter (~77M params, ~300MB) that inserts weights into the UNet instead of copying it — smaller than a ControlNet. Load with `T2IAdapter`, use `StableDiffusionXLAdapterPipeline`, weight with `adapter_conditioning_scale`.

```py
from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter
adapter = T2IAdapter.from_pretrained("path/to/adapter", torch_dtype=torch.float16)
pipeline = StableDiffusionXLAdapterPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", adapter=adapter, torch_dtype=torch.float16)
```

## IP-Adapter

A lightweight adapter (~100MB) integrating **image**-based guidance via an image encoder and new cross-attention layers. Load a base model, then `load_ip_adapter(...)`, and pass `ip_adapter_image` with the text prompt.

```py
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
pipeline.set_ip_adapter_scale(0.8)
pipeline(prompt="a polar bear...", ip_adapter_image=image).images[0]
```

`set_ip_adapter_scale()`: `1.0` conditions only on the image prompt, `0.5` balances text and image. Variants: **Plus** (patch embeddings, ViT-H encoder) and **FaceID** (InsightFace embeddings). Call `enable_model_cpu_offload()` **after** loading the IP-Adapter, else its image encoder is offloaded and errors. For multiple, pass lists of weight names + scales; combine with a ControlNet for structure or LCM for speed.

## InstructPix2Pix

A Stable Diffusion model trained to edit images from instructions (e.g. "turn the clouds rainy"), conditioned on the instruction + input image. Use `StableDiffusionInstructPix2PixPipeline`, tuning `image_guidance_scale` and `guidance_scale`:

```py
from diffusers import StableDiffusionInstructPix2PixPipeline
pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained("your_cool_model", torch_dtype=torch.float16).to("cuda")
edited_image = pipeline(prompt, image=image, num_inference_steps=20,
                        image_guidance_scale=1.5, guidance_scale=10).images[0]
```

Related: [[concepts/loras-for-inference]], [[summaries/model-and-feature-catalog]], [[syntheses/choosing-model-and-pipeline]].


<!-- ===== stable-diffusion/wiki/concepts/fine-tuning.md ===== -->

---
title: "Fine-Tuning Stable Diffusion: LoRA, DreamBooth, Textual Inversion, Custom Diffusion"
type: concept
tags: [fine-tuning, lora, dreambooth, textual-inversion, custom-diffusion, training]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-dreambooth.md, raw/llms_txt_doc-dreambooth-2.md, raw/llms_txt_doc-textual-inversion.md, raw/llms_txt_doc-textual-inversion-2.md, raw/llms_txt_doc-custom-diffusion.md, raw/llms_txt_doc-lora.md, raw/llms_txt_doc-train-a-diffusion-model.md, raw/llms_txt_doc-overview.md, raw/llms_txt_doc-stable-diffusion-xl.md]
---
# Fine-Tuning Stable Diffusion: LoRA, DreamBooth, Textual Inversion, Custom Diffusion

The Diffusers training scripts live in [`diffusers/examples`](https://github.com/huggingface/diffusers/tree/main/examples) — each self-contained and single-purpose, exposing the data-preprocessing code and training loop. Install from source first:

```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
cd examples/<technique>
pip install -r requirements.txt
```

Launch with `accelerate launch <script>.py ...`; init the environment first via `accelerate config` (or `accelerate config default`). Use PyTorch 2.0+ (SDPA) and [xFormers](https://github.com/huggingface/diffusers) for memory-efficient attention. See also [[concepts/loras-for-inference]] and [[concepts/optimization-and-memory]].

## Which technique to use

| Technique | What it trains | Output size | Trigger | Best for |
|---|---|---|---|---|
| **Textual Inversion** | text embeddings only | a few KBs | placeholder token e.g. `<cat-toy>` | one new object/style from 3-5 images |
| **LoRA** | low-rank weights in UNet (+ optional text encoder) | a few hundred MBs | depends on training prompt | fast, cheap, shareable style/subject |
| **DreamBooth** | the entire model | a few GBs | unique identifier e.g. `sks` | high-fidelity subject personalization |
| **Custom Diffusion** | cross-attention key/value weights + modifier token | small (`.bin`) | `modifier_token` e.g. `<new1>` | multi-concept at once |

All four need only a few example images (~3-5; Custom Diffusion ~4-5). LoRA can be **combined** with DreamBooth to speed up training.

## Canonical launch (LoRA text-to-image)

All four scripts share the same `accelerate launch ... --pretrained_model_name_or_path=$MODEL_NAME` shape; the LoRA text-to-image run is the canonical example:

```bash
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export DATASET_NAME="lambdalabs/naruto-blip-captions"
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=4 \
  --max_train_steps=15000 --learning_rate=1e-04 --lr_scheduler="cosine" \
  --checkpointing_steps=500 --validation_prompt="A naruto with blue eyes." --seed=1337
```

The other three swap the script name and a few distinctive flags:

| Technique | Script | Distinctive flags | Saves |
|---|---|---|---|
| **DreamBooth** | [`train_dreambooth.py`](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) | `--instance_data_dir`, `--instance_prompt="a photo of sks dog"`, `--learning_rate=5e-6 --lr_scheduler="constant" --max_train_steps=400 --push_to_hub` | full weights |
| **Textual Inversion** | [`textual_inversion.py`](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) | `--train_data_dir`, `--learnable_property="object"`, `--placeholder_token="<cat-toy>" --initializer_token="toy"`, `--max_train_steps=3000 --learning_rate=5.0e-04 --scale_lr` | `learned_embeds.bin` |
| **Custom Diffusion** | [`train_custom_diffusion.py`](https://github.com/huggingface/diffusers/blob/main/examples/custom_diffusion/train_custom_diffusion.py) | `--freeze_model` (`crossattn_kv`/`crossattn`), `--modifier_token="<new1>"` (`<new1>+<new2>` + `--concepts_list` for multi), `--real_prior` | `.bin` |

## Per-technique notes

**LoRA** trains only inserted low-rank weights — fast, small output. Saves `pytorch_lora_weights.safetensors`; ~5 hours on a 2080 Ti (11GB VRAM). `--rank` = inner dimension (higher = more params); `--learning_rate` default `1e-4` (LoRA tolerates higher). Uses `LoraConfig` from [PEFT](https://hf.co/docs/peft) with `target_modules=["to_k", "to_q", "to_v", "to_out.0"]` for the UNet. Supported for DreamBooth, SDXL, text-to-image, Kandinsky 2.2, Wuerstchen; for SDXL use [`train_dreambooth_lora_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py).

**DreamBooth** fine-tunes the **entire** model; very hyperparameter-sensitive, easy to overfit. Key params: `--instance_data_dir`, `--instance_prompt`, `--train_text_encoder`, `--checkpointing_steps`. **Prior preservation loss** (`--with_prior_preservation`, `--prior_loss_weight=1.0`, `--class_data_dir`, `--class_prompt`) retains class knowledge. `--train_text_encoder` improves faces but needs ≥24GB VRAM. `--snr_gamma=5.0` (Min-SNR) speeds convergence. VRAM tiers: 16GB → `--gradient_checkpointing --use_8bit_adam` (bitsandbytes); 12GB → add `--enable_xformers_memory_efficient_attention --set_grads_to_none`; 8GB → DeepSpeed stage 2 + fp16 + CPU offload (~25GB system RAM).

**Textual Inversion** updates only **text embeddings** of a placeholder token (~1 hour on a V100); `--num_vectors` sets embedding count. Can learn **negative embeddings** (e.g. EasyNegative) to steer away from "blurry"/"ugly".

**Custom Diffusion** trains **only cross-attention key/value weights** + modifier token, and uniquely learns **multiple concepts at once**. ~4-5 images, ~16GB VRAM with `--enable_xformers_memory_efficient_attention`; regularization downloads ~200 real images via `clip-retrieval`. For faces: `--learning_rate=5e-6`, `--max_train_steps` 1000-2000, `--freeze_model=crossattn`, 15-20+ images.

## Training a model from scratch

For unconditional generation, train a [`UNet2DModel`](/docs/diffusers/v0.38.0/en/api/models/unet2d#diffusers.UNet2DModel) from scratch with `DDPMScheduler(num_train_timesteps=1000)` and MSE loss on predicted noise via an Accelerate loop — the pattern all fine-tuning scripts build on.


<!-- ===== stable-diffusion/wiki/concepts/image-to-image-and-inpainting.md ===== -->

---
title: "Image-to-Image and Inpainting"
type: concept
tags: [image-to-image, inpainting, outpainting, depth2img, strength]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-image-to-image.md, raw/llms_txt_doc-inpainting.md, raw/llms_txt_doc-outpainting.md, raw/llms_txt_doc-text-guided-depth-to-image-generation.md]
---
# Image-to-Image and Inpainting

These tasks condition generation on an existing image: img2img transforms a whole image; inpainting edits a masked region; outpainting extends beyond the borders; depth-to-image preserves spatial structure.

## Image-to-image

Like text-to-image, but you also pass an initial image: it is encoded to latent space, noise is added, and the model denoises it guided by the prompt.

```py
from diffusers import AutoPipelineForImage2Image
pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
pipeline.enable_model_cpu_offload()
image = pipeline(prompt, image=init_image, strength=0.8).images[0]
```

### strength

`strength` is the most important img2img parameter — how much the output resembles the initial image. Higher = more "creativity" (`1.0` more or less ignores it); lower stays closer to it. It sets the noise steps added: `num_inference_steps=50` with `strength=0.8` adds 40 (50 × 0.8) steps of noise, then denoises for 40. Combine with `guidance_scale` for finer control.

## Inpainting

Inpainting edits areas using a **mask**: white pixels mark the area to fill (by the prompt); black pixels mark the area to keep. Load with `AutoPipelineForInpainting` and pass `image` and `mask_image`.

```py
from diffusers import AutoPipelineForInpainting
pipeline = AutoPipelineForInpainting.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16")
pipeline.enable_model_cpu_offload()
image = pipeline(prompt=prompt, negative_prompt=negative_prompt,
                 image=init_image, mask_image=mask_image, strength=0.6).images[0]
```

Popular inpaint checkpoints: `stable-diffusion-v1-5/stable-diffusion-inpainting`, `diffusers/stable-diffusion-xl-1.0-inpainting-0.1`, `kandinsky-community/kandinsky-2-2-decoder-inpaint`. Regular checkpoints like `stable-diffusion-v1-5/stable-diffusion-v1-5` work too, but inpaint-specific ones give cleaner transitions. Options:
- **`blur_factor`** — `pipeline.mask_processor.blur(mask, blur_factor=33)` softens mask edges.
- **`padding_mask_crop`** — crops, upscales, re-overlays the masked area for quality, e.g. `padding_mask_crop=32`.
- **`apply_overlay()`** — `pipeline.image_processor.apply_overlay(...)` forces the unmasked area unchanged.

`strength` and `guidance_scale` behave as in img2img; inpainting just restricts changes to the masked region.

## Outpainting

Extends an image beyond its boundaries: fill the white area (outside the original), keep the original (black-pixel mask). Use an inpainting model, a [[concepts/controlnet-and-adapters|ControlNet]], or Differential Diffusion; the docs show an SDXL inpaint pipeline plus a ZoeDepth estimator (`pip install controlnet_aux`). SDXL works best at 1024x1024.

## Depth-to-image

`StableDiffusionDepth2ImgPipeline` conditions on a prompt + initial image, optionally with a `depth_map` to preserve structure (otherwise predicted via MiDaS):

```python
from diffusers import StableDiffusionDepth2ImgPipeline
pipeline = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
image = pipeline(prompt="two tigers", image=init_image,
                 negative_prompt="bad, deformed, ugly, bad anatomy", strength=0.7).images[0]
```

Related: [[concepts/text-to-image]], [[concepts/prompting]], [[concepts/sdxl]], [[concepts/controlnet-and-adapters]].


<!-- ===== stable-diffusion/wiki/concepts/installation-and-setup.md ===== -->

---
title: "Installation and Setup"
type: concept
tags: [installation, setup, diffusers, pytorch]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-installation.md, raw/llms_txt_doc-quickstart.md, raw/llms_txt_doc-basic-performance.md]
---
# Installation and Setup

## Install

Tested on Python 3.8+ and PyTorch 1.4+ (Windows: Python 3.8-3.11). Install [PyTorch](https://pytorch.org/get-started/locally/) first, then Diffusers in a venv:

```bash
uv venv my-env && source my-env/bin/activate
uv pip install diffusers["torch"] transformers accelerate
# conda: conda install -c conda-forge diffusers
# from source: uv pip install git+https://github.com/huggingface/diffusers
```

[Accelerate](https://huggingface.co/docs/accelerate/index) is recommended for loading/running models.

## First generation

Load a model with `from_pretrained()`, then access the result via `.images[0]`:

```py
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16, device_map="cuda")
pipeline("cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California").images[0]
```

`DiffusionPipeline` packages the components (text encoder(s), scheduler, UNet/DiT, VAE) into one class. Several `__call__()` arguments, such as `num_inference_steps`, affect the diffusion process.

## Device placement and precision

- Add `device_map="cuda"` to place the pipeline on a GPU (parallel computation, big speedup), or call `pipeline.to("cuda")`.
- Set `torch_dtype=torch.bfloat16` (or `torch.float16`) for half-precision — less memory, more speed. If limited by GPU memory (e.g. less than 10GB), load in float16 instead of the default float32.
- On Apple Silicon, place the pipeline on `mps` (Metal Performance Shaders) instead of `cuda`.

## Memory and speed tips

If a model doesn't fit, move idle components to CPU with `pipeline.enable_model_cpu_offload()` instead of `.to("cuda")`. For faster sampling, use a faster scheduler (`DPMSolverMultistepScheduler`, ~20-25 steps) or lower `num_inference_steps`:

```py
from diffusers import DPMSolverMultistepScheduler
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
```

## Cache and offline use

Weights download from the Hub to a cache (usually your home directory). Override with `HF_HOME` / `HF_HUB_CACHE` or the `cache_dir` parameter; set `HF_HUB_OFFLINE=1` for offline:

```bash
export HF_HOME="/path/to/your/cache"
export HF_HUB_OFFLINE=1
```

See [[concepts/text-to-image]] for the generation loop, [[concepts/optimization-and-memory]] for deeper tuning, [[entities/diffusers-library]] for the library overview.


<!-- ===== stable-diffusion/wiki/concepts/loras-for-inference.md ===== -->

---
title: "LoRAs for Inference"
type: concept
tags: [lora, peft, adapters, set-adapters, fuse-lora]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-lora-2.md]
---
# LoRAs for Inference

LoRA (Low-Rank Adaptation) adds a few trainable weights to a frozen base model — fast adaptation, small checkpoints (a couple hundred MBs). This page covers *loading and blending* adapters at inference; for training see [[concepts/fine-tuning]]. See also [[concepts/text-to-image]] and [[concepts/sdxl]].

## Loading a LoRA

`load_lora_weights()` loads into both UNet and text encoder (handles weights with or without separate identifiers); specify `weight_name` and `adapter_name`:

```py
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora",
    weight_name="cereal_box_sdxl_v1.safetensors", adapter_name="cereal")
pipeline("bears, pizza bites").images[0]
```

To load only into the UNet at model level, use `pipeline.unet.load_lora_adapter(..., prefix="unet")`.

## Weight scale

The `scale` parameter controls how much of a LoRA to apply: `0` = base model only, `1` = fully apply. For simple cases pass `cross_attention_kwargs={"scale": 1.0}`. For per-component control, pass a scale dict to `set_adapters` (e.g. scale the UNet `"down"` block by 0.9). `set_adapters()` only scales attention weights; ResNets and up/downsamplers stay at `1.0`.

## Multiple adapters with set_adapters

`set_adapters()` merges LoRAs by concatenating weighted matrices (`adapter_weights` scales each) and activates which are in use:

```py
pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
pipeline("A bowl of ramen ..., by Feng Zikai", cross_attention_kwargs={"scale": 1.0}).images[0]
```

`set_adapters("feng")` switches the active LoRA; `disable_lora()` disables all (keeping them loaded); `unload_lora_weights()` restores base weights; `delete_adapters("ikea")` removes one entirely. Inspect with `get_active_adapters()` and `get_list_adapters()`.

## Fusing (fuse_lora) and torch.compile

`fuse_lora()` fuses LoRA weights into the UNet/text encoder, lowering memory and raising speed. `lora_scale` controls scaling at fuse time (`scale` via `cross_attention_kwargs` won't work after fusing).

```py
pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)
pipeline.unload_lora_weights()
pipeline.save_pretrained("path/to/fused-pipeline")
```

`unfuse_lora()` restores the base weights, but only when a single LoRA is fused. Before `torch.compile`, LoRA weights must be fused into the base model and unloaded first:

```py
pipeline.set_adapters("ikea", adapter_weights=0.7)
pipeline.fuse_lora(adapter_names=["ikea"], lora_scale=1.0)
pipeline.unload_lora_weights()
pipeline.unet.to(memory_format=torch.channels_last)
pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
```

## Hotswapping

Hotswapping replaces an existing loaded LoRA's weights in place, avoiding accumulated memory and (for compiled models) recompilation. Set `hotswap=True` in `load_lora_weights()`. For compiled models, call `enable_lora_hotswap(target_rank=max_rank)` *before* loading the first LoRA and `torch.compile` *after*. Unsupported for LoRAs targeting the text encoder. `add_weighted_adapter` enables merge methods like TIES or DARE (LoRAs must have identical ranks).

Related: [[concepts/schedulers-and-samplers]], [[concepts/controlnet-and-adapters]], [[concepts/optimization-and-memory]].


<!-- ===== stable-diffusion/wiki/concepts/optimization-and-memory.md ===== -->

---
title: "Optimization and Memory"
type: concept
tags: [memory, offload, torch-compile, xformers, automatic1111]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-reduce-memory-usage.md, raw/llms_txt_doc-accelerate-inference.md, raw/llms_txt_doc-basic-performance.md, raw/llms_txt_doc-xformers.md, raw/llms_txt_doc-attention-backends.md, raw/web_community-optimizations-md.md]
---
# Optimization and Memory

Diffusion is iterative and compute-heavy; balance memory and speed. Reducing memory also helps a model fit on device. See [[concepts/sdxl]] and [[entities/diffusers-library]].

## Reduce memory (diffusers)

- **Model CPU offload** — `pipeline.enable_model_cpu_offload()` moves whole models (text encoder, UNet, VAE) on/off the GPU; faster than sequential, smaller savings. Single GPU.
- **Sequential CPU offload** — `pipeline.enable_sequential_cpu_offload()` moves submodules; large savings but **extremely slow**. Don't `.to("cuda")` first.
- **VAE slicing** — `pipeline.enable_vae_slicing()` decodes one image at a time; best for multi-image batches.
- **VAE tiling** — `pipeline.enable_vae_tiling()` splits into overlapping tiles to lower peak memory; disabled below a configurable limit (512x512 for the SD VAE).
- **Group offloading** — `pipeline.enable_group_offload(onload_device=..., offload_device=..., offload_type="leaf_level", use_stream=True)` offloads layer groups; less memory than model offload, faster than CPU offload.
- **channels_last** — `pipeline.unet.to(memory_format=torch.channels_last)`.
- **Multi-GPU** — `device_map="balanced"` splits the pipeline evenly; `max_memory={0:"1GB",1:"1GB"}` caps per-device usage. Call `pipeline.reset_device_map()` before `.to()`/offload on a device-mapped pipeline.

## Speed up (diffusers)

- **Precision** — set `torch_dtype=torch.bfloat16` (more robust) or `torch.float16` at load. On NVIDIA Ampere enable tf32 matmul: `torch.backends.cuda.matmul.allow_tf32 = True`.
- **SDPA** — Scaled dot product attention is on by default for PyTorch >= 2.0 and auto-selects the best backend (FlashAttention, xFormers, native C++). Force one with `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):`.
- **torch.compile** — compiles to optimized kernels. `"max-autotune"` is fastest (CUDA graph); always pass `fullgraph=True`. Combine with channels_last:

```py
pipeline.unet.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)
pipeline.unet = torch.compile(pipeline.unet, mode="max-autotune", fullgraph=True)
pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True)
```

Compilation is slow the first time; reuse the compiled pipeline on the same image size (a different size retriggers it). Add `dynamic=True` to reduce recompilation across resolutions. Other speedups: faster scheduler (`DPMSolverMultistepScheduler`, ~20-25 steps), lower `num_inference_steps`, `pipeline.fuse_qkv_projections()`.

## Attention backends (xFormers etc.)

Diffusers routes through an attention dispatcher. Install xFormers with `pip install xformers` (requires the latest PyTorch), then select a backend with `model.set_attention_backend(...)` — e.g. `"flash"` for FlashAttention, `"_flash_3_hub"` for FlashAttention-3 on Hopper. Restore with `reset_attention_backend()`. Recommended for inference and training (faster, lower memory).

## AUTOMATIC1111 webui flags

VRAM/attention flags (full list in [[entities/automatic1111-webui]]):

- `--xformers` — xFormers; great memory & speed gain, Nvidia only. `--force-enable-xformers` bypasses detection.
- `--opt-sdp-attention` — may beat xFormers but uses more VRAM (non-deterministic); `--opt-sdp-no-mem-attention` is the deterministic variant.
- `--opt-split-attention` — on by default for `torch.cuda`; `--opt-sub-quad-attention` and `--opt-split-attention-v1` are lower-memory alternatives.
- `--medvram` — splits model into cond / first_stage / unet, one in VRAM at a time (slight perf hit). `--lowvram` splits the unet into many modules (devastating for performance).
- `--opt-channelslast` — channels-last format. `--upcast-sampling` for cards otherwise forced to `--no-half`.

Related: [[concepts/schedulers-and-samplers]], [[concepts/loras-for-inference]], [[syntheses/troubleshooting-and-quality]].


<!-- ===== stable-diffusion/wiki/concepts/prompting.md ===== -->

---
title: "Prompting and Prompt Weighting"
type: concept
tags: [prompting, negative-prompt, prompt-weighting, emphasis]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-prompting.md, raw/web_community-negative-prompt-md.md, raw/web_community-features-md.md]
---
# Prompting and Prompt Weighting

A prompt describes what the model should generate. Good prompts are detailed, specific, and structured.

## Writing good prompts

Every effective prompt needs three core elements:

1. **Subject** — what you want to generate. Start here.
2. **Style** — the medium or aesthetic.
3. **Context** — actions, setting, and mood.

Use a structured narrative, not a keyword list — modern models understand language better than keyword matching. Start simple, then add details; context (lighting, artistic details, mood) matters most, and photography terms (lens type, focal length, camera angles, depth of field) help.

## Negative prompts

A *negative prompt* specifies what you don't want — commonly to remove deformities (extra limbs, etc.) and improve quality without spending the prompt's 75-token allowance. Mechanically, it replaces the empty string for `unconditional_conditioning` during sampling, so the sampler moves toward the prompt and away from the negative.

```py
image = pipeline(prompt="Astronaut in a jungle, ...",
    negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy").images[0]
```

Common quality negatives: `low quality`, `blurry`, `poor details`. Negative prompts work the same in text-to-image, image-to-image, and inpainting.

## Prompt weighting (emphasis)

Prompt weighting makes words stronger or weaker by scaling attention scores, controlling each concept's influence.

### AUTOMATIC1111 syntax

In the AUTOMATIC1111 web UI, `()` increases attention to enclosed words and `[]` decreases it:

- `a (word)` — increase attention to `word` by a factor of 1.1
- `a ((word))` — increase by 1.21 (= 1.1 × 1.1)
- `a [word]` — decrease attention by a factor of 1.1
- `a (word:1.5)` — increase attention by a factor of 1.5
- `a (word:0.25)` — decrease by a factor of 4 (= 1 / 0.25)
- `a \(word\)` — use literal `()` characters in the prompt

With `()`, a weight is specified as `(text:1.4)`; if omitted it is assumed to be 1.1. Weights only work with `()`, not `[]`.

### In Diffusers

Diffusers handles weighting via the `prompt_embeds` (and `pooled_prompt_embeds` / `negative_prompt_embeds`) arguments, which take scaled text-embedding vectors. The [sd_embed](https://github.com/xhinker/sd_embed) library generates these (same parenthesis/multiplier syntax) and supports longer prompts:

```py
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sdxl
prompt = "A (cute cat:1.4) lounges on a (floating leaf:1.2) in a (sparkling pool:1.1)..."
prompt_embeds, _, pooled_prompt_embeds, *_ = get_weighted_text_embeddings_sdxl(pipeline, prompt=prompt)
image = pipeline(prompt_embeds=prompt_embeds, pooled_prompt_embeds=pooled_prompt_embeds).images[0]
```

`sd_embed` supports Stable Diffusion, SDXL, SD3, Stable Cascade, and Flux. (The [[entities/diffusers-library|Diffusers]] docs also reference Compel.) Weighting works with Textual Inversion and DreamBooth adapters, but may not help newer models like Flux with strong prompt adherence.

Related: [[concepts/text-to-image]], [[concepts/image-to-image-and-inpainting]], [[entities/automatic1111-webui]], [[syntheses/troubleshooting-and-quality]].


<!-- ===== stable-diffusion/wiki/concepts/schedulers-and-samplers.md ===== -->

---
title: "Schedulers and Samplers"
type: concept
tags: [schedulers, samplers, karras, lcm, timesteps]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-schedulers.md, raw/llms_txt_doc-latent-consistency-model.md, raw/llms_txt_doc-guiders.md]
---
# Schedulers and Samplers

A scheduler (sampler) instructs the denoising process — how much noise to remove per step. Different schedulers trade speed vs. accuracy. Diffusers lets you swap schedulers and customize schedules, spacing, and sigmas for high quality in fewer steps. See [[concepts/text-to-image]].

## Swapping the scheduler

Rebuild a scheduler from the existing config (view it via `pipeline.scheduler`):

```py
from diffusers import DPMSolverMultistepScheduler
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
```

You can also `from_pretrained(..., subfolder="scheduler")` and pass `scheduler=` to the pipeline. Most schedulers accept extra `from_config` kwargs, e.g. `algorithm_type="sde-dpmsolver++"`, `timestep_spacing="trailing"`, `use_karras_sigmas=True`.

## Timestep spacing

Pass to the scheduler's `timestep_spacing` argument:

| spacing strategy | spacing calculation | example timesteps |
|---|---|---|
| `leading` | evenly spaced steps | `[900, 800, 700, ..., 100, 0]` |
| `linspace` | include first and last steps, evenly divide the rest | `[1000, 888.89, 777.78, ..., 111.11, 0]` |
| `trailing` | include last step, evenly divide remaining beginning from the end | `[999, 899, 799, 699, 599, 499, 399, 299, 199, 99]` |

`trailing` typically gives higher-quality, more detailed images in fewer steps. For `v_prediction` models, set `rescale_betas_zero_snr=True` and `timestep_spacing="trailing"`, then `guidance_rescale` (e.g. `0.7`) to avoid overexposed images.

## Karras sigmas

`use_karras_sigmas=True` resamples the noise schedule, clustering sigmas densely in the middle (where structure reconstruction matters), increasing detail. Only for models trained with Karras sigmas — e.g. `DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, algorithm_type="sde-dpmsolver++", use_karras_sigmas=True)`.

You can also pass custom `sigmas`/`timesteps` arrays to the pipeline call (only some schedulers/pipelines). The Align Your Steps (AYS) schedule `AysSchedules["StableDiffusionXLTimesteps"]` is `[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]`, a high-quality image in 10 steps.

## Choosing a scheduler

- DPM++ 2M SDE Karras — good all-purpose default.
- `TCDScheduler` — distilled models.
- `FlowMatchEulerDiscreteScheduler` / `FlowMatchHeunDiscreteScheduler` — FlowMatch models.
- `EulerDiscreteScheduler` / `EulerAncestralDiscreteScheduler` — anime-style.
- DPM++ 2M + `LCMScheduler` on SDXL — realistic images.

Fewer steps reduce computation but can lower quality. `DPMSolverMultistepScheduler` needs only ~20-25 steps.

## LCM for few-step generation

LCMs generate high-quality images in **2-4 steps** instead of 20-30. Replace the scheduler with `LCMScheduler`, then load an LCM UNet checkpoint or an LCM-LoRA via `load_lora_weights(...)` (see [[concepts/loras-for-inference]]).

```py
from diffusers import DiffusionPipeline, LCMScheduler
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
image = pipe(prompt=prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
```

LCM applies guidance via embeddings, so negative prompts have no effect; ideal `guidance_scale` is [3., 13.] for LCM UNets (1.0 works) and [1.0, 2.0] for LCM-LoRAs.

## Guiders (Modular Diffusers)

Classifier-free guidance steers generation to better match a prompt. In Modular Diffusers these are *guiders*; `ClassifierFreeGuidance` is the default (`guidance_scale=7.5`). Switch via `update_components(guider=...)`, e.g. `PerturbedAttentionGuidance`, and adjust parameters with `get_component_spec("guider")` + `create(guidance_scale=10)`.

Related: [[concepts/sdxl]], [[concepts/optimization-and-memory]], [[syntheses/troubleshooting-and-quality]].


<!-- ===== stable-diffusion/wiki/concepts/sdxl.md ===== -->

---
title: "Stable Diffusion XL (SDXL)"
type: concept
tags: [sdxl, refiner, sdxl-turbo, micro-conditioning]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-stable-diffusion-xl-2.md, raw/llms_txt_doc-stable-diffusion-xl-turbo.md]
---
# Stable Diffusion XL (SDXL)

SDXL iterates on Stable Diffusion three ways: the UNet is 3x larger; it adds a second text encoder (OpenCLIP ViT-bigG/14) alongside the original; and the *base* model output feeds a *refiner* that adds high-quality details. SDXL defaults to **1024x1024**; `height`/`width` can be 768x768 or 512x512, but below 512x512 is unlikely to work. See [[concepts/text-to-image]] and [[concepts/optimization-and-memory]].

## Load the base (and refiner)

```py
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16").to("cuda")
```

The base runs standalone. `add_watermarker=False` disables the invisible-watermark library (default when installed). `AutoPipelineForText2Image`/`AutoPipelineForImage2Image` with `from_pipe(...)` reuse a loaded checkpoint without extra memory.

## Two-stage refiner

**Ensemble of expert denoisers** (faster, fewer total steps): the base denoises high-noise timesteps, the refiner low-noise timesteps. Control the split with `denoising_end` (base) and `denoising_start` (refiner), each a float between 0 and 1. The base output must be in **latent** space.

```py
image = base(prompt=prompt, num_inference_steps=40, denoising_end=0.8, output_type="latent").images
image = refiner(prompt=prompt, num_inference_steps=40, denoising_start=0.8, image=image).images[0]
```

**Base to refiner** (image-to-image on the fully-denoised image):

```py
image = base(prompt=prompt, output_type="latent").images[0]
image = refiner(prompt=prompt, image=image[None, :]).images[0]
```

## Micro-conditioning

SDXL was trained with `original_size`, `target_size`, and crop conditioning. At inference, `original_size`/`target_size` default to `(1024, 1024)` (the default gives higher quality); `crops_coords_top_left=(0, 0)` (default) correlates with centered subjects and complete faces. Negative variants `negative_original_size`, `negative_target_size`, `negative_crops_coords_top_left` steer away from given resolutions/crops via classifier-free guidance.

## Two prompts (dual text encoders)

Pass a different prompt to each encoder: `prompt` → OAI CLIP-ViT/L-14, `prompt_2` → OpenCLIP-ViT/bigG-14 (with `negative_prompt`/`negative_prompt_2`).

## SDXL-Turbo (1-step)

`stabilityai/sdxl-turbo` is an adversarial time-distilled SDXL running in as little as **1 step**, defaulting to **512x512** (best at that size). Set `guidance_scale=0.0` (trained without guidance):

```py
from diffusers import AutoPipelineForText2Image
pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16").to("cuda")
image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0]
```

2-4 steps improve quality. For image-to-image, ensure `num_inference_steps * strength >= 1`. Keep the default VAE in `float32` via `pipe.upcast_vae()` (or use 16-bit `madebyollin/sdxl-vae-fp16-fix`).

## Optimizations

- Out-of-memory: `base.enable_model_cpu_offload()` / `refiner.enable_model_cpu_offload()`.
- ~20% speed-up (`torch>=2.0`): `base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True)`.
- The SDXL VAE is unstable in fp16; use `madebyollin/sdxl-vae-fp16-fix` if needed.

Related: [[concepts/schedulers-and-samplers]], [[concepts/loras-for-inference]], [[syntheses/choosing-model-and-pipeline]].


<!-- ===== stable-diffusion/wiki/concepts/text-to-image.md ===== -->

---
title: "Text-to-Image Generation"
type: concept
tags: [text-to-image, inference, guidance-scale, seeds, diffusers]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-text-to-image-2.md, raw/llms_txt_doc-autopipeline.md, raw/llms_txt_doc-reproducibility.md, raw/llms_txt_doc-diffusionpipeline.md]
---
# Text-to-Image Generation

Text-to-image generates an image from a text description (the *prompt*). The model takes the prompt and random initial noise, iteratively removes the noise guided by the prompt, then decodes the final latent into an image.

## The core loop

Load a checkpoint into `AutoPipelineForText2Image`, which auto-detects the right pipeline class, then pass a prompt:

```py
from diffusers import AutoPipelineForText2Image
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16").to("cuda")
image = pipeline("stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k").images[0]
```

For SDXL, swap the model id to `stabilityai/stable-diffusion-xl-base-1.0`.

### AutoPipeline vs DiffusionPipeline

- `AutoPipelineForText2Image` is a *task-and-model* pipeline — returns the task-specific subclass, the recommended entry point for text-to-image (also supports PAG via `from_pipe()`).
- `DiffusionPipeline` is a *model-only* pipeline — picks the subclass from the checkpoint's `model_index.json` (e.g. `StableDiffusionXLPipeline`), which can do text-to-image, image-to-image, or inpainting depending on inputs.

## Key parameters

- **`num_inference_steps`** — denoising steps. Default 50; more improves quality but is slower, fewer is faster but degrades quality.
- **`guidance_scale`** — classifier-free guidance; how strongly the prompt influences the image. Lower = more "creativity"; higher follows the prompt closely but can add artifacts. 7-8.5 is usually good; default 7.5. `guidance_scale == 1` means no classifier-free guidance.
- **`height` / `width`** — output size in pixels. SD v1.5 defaults to 512x512 (both must be multiples of 8); SDXL defaults to 1024x1024, lower may reduce quality — check the model's API reference.
- **`negative_prompt`** — steers away from unwanted features (see [[concepts/prompting]]).

Example: `pipeline(prompt, guidance_scale=3.5, height=768, width=512).images[0]`.

## Seeds and reproducibility

Diffusion is random — a different image each run. Pass a seeded `torch.Generator` for deterministic output:

```py
generator = torch.Generator(device="cuda").manual_seed(30)
image = pipeline(prompt, generator=generator).images[0]
```

For CPU/GPU portability, prefer a **CPU Generator** (`torch.Generator(device="cpu").manual_seed(0)`); perf loss is negligible. The `Generator` holds a *random state* consumed when used, so the same object yields different results on subsequent calls — recreate it each loop iteration (`generator=torch.manual_seed(0)`) to regenerate. For strict determinism use `enable_full_determinism()`.

## Safety checker

Older models include a safety checker that screens output against hardcoded harmful concepts. Disable it with `safety_checker=None` in `from_pretrained()` (keep it enabled for public-facing use).

Related: [[concepts/installation-and-setup]], [[concepts/prompting]], [[concepts/image-to-image-and-inpainting]], [[concepts/schedulers-and-samplers]].


<!-- ===== stable-diffusion/wiki/concepts/what-is-stable-diffusion.md ===== -->

---
title: "What Is Stable Diffusion"
type: concept
tags: [stable-diffusion, latent-diffusion, models, overview]
updated: 2026-06-23
confidence: high
sources: [raw/web_community-stable-diffusion-with-diffusers.md, raw/llms_txt_doc-understanding-pipelines-models-and-schedulers.md, raw/llms_txt_doc-diffusers.md, raw/web_community-stable-diffusion-3-stability-ai.md, raw/web_community-stabilityai-stable-diffusion-3-5-large-hugging-face.md, raw/web_community-stabilityai-stable-diffusion-xl-base-1-0-hugging-face.md]
---
# What Is Stable Diffusion

Stable Diffusion is a text-to-image **latent diffusion model** created by researchers and engineers from CompVis, Stability AI and LAION, trained on 512x512 images from a subset of the LAION-5B database.

## Latent diffusion

A diffusion model is trained to *denoise* random Gaussian noise step by step into a sample. Standard diffusion runs in slow, memory-heavy pixel space; latent diffusion runs over a lower-dimensional **latent** space. The SD autoencoder has a reduction factor of 8, so `(3, 512, 512)` becomes `(4, 64, 64)` — compression ratio 8 × 8 = 64. Three components:

- **Autoencoder (VAE)** — encoder compresses an image to a latent; decoder reverses it. At inference you only need the decoder.
- **U-Net** (or, in newer models, a diffusion transformer) — predicts the noise residual at each step, conditioned on text via cross-attention.
- **Text-encoder** — e.g. CLIP's `CLIPTextModel` — turns the prompt into embeddings.

At inference the U-Net iteratively denoises random latents (64×64) conditioned on text embeddings (77×768), ca. 50 times; the VAE decoder converts the final latent to a 512×512 image. A **scheduler** computes each step.

## Model versions

- **v1-4 / v1.5** — original latent diffusion checkpoints. v1.5 is initialized from v1-4 and finetuned for 595K steps on 512x512 images (default output 512×512). Id: `stable-diffusion-v1-5/stable-diffusion-v1-5`.
- **v2 / 2.1** — successor checkpoints usable with minimal code changes.
- **SDXL** (`stabilityai/stable-diffusion-xl-base-1.0`) — latent diffusion with two fixed text encoders (OpenCLIP-ViT/G and CLIP-ViT/L); an *ensemble of experts* where the base generates latents that a separate refiner (`stabilityai/stable-diffusion-xl-refiner-1.0`) further denoises. Default 1024×1024. See [[concepts/sdxl]].
- **SD3** — Stability AI's most capable text-to-image model, combining a *diffusion transformer* and *flow matching*; suite ranges 800M-8B parameters.
- **SD 3.5 Large** (`stabilityai/stable-diffusion-3.5-large`) — a Multimodal Diffusion Transformer (MMDiT) with three fixed text encoders (OpenCLIP-ViT/G, CLIP-ViT/L, T5-xxl) and QK-normalization. Run via `StableDiffusion3Pipeline`.

## Using it

Most commonly run through the Hugging Face [[entities/diffusers-library]] (`diffusers`), built around the `DiffusionPipeline` — few-line inference, component swapping, and adapter (LoRA) support.

Next: [[concepts/installation-and-setup]], [[concepts/text-to-image]], [[syntheses/choosing-model-and-pipeline]].


<!-- ===== stable-diffusion/wiki/entities/automatic1111-webui.md ===== -->

---
title: "AUTOMATIC1111 Stable Diffusion web UI"
type: entity
tags: [automatic1111, webui, gradio, command-line-arguments, features]
updated: 2026-06-23
confidence: high
sources: [raw/web_community-features-md.md, raw/web_community-command-line-arguments-and-settings-md.md, raw/web_community-optimizations-md.md, raw/web_community-troubleshooting-md.md]
---
# AUTOMATIC1111 Stable Diffusion web UI

[AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) is a Gradio browser UI for Stable Diffusion — the de-facto enthusiast front-end and a primary target for single-file `.safetensors`/`.ckpt` checkpoints (alongside ComfyUI). Unlike the [[entities/diffusers-library]] (a Python library), it's an application configured by command-line flags and config files. Tested on **Python 3.10.6** (others not recommended).

## Configuration model

Launch settings come from `webui-user.bat` / `webui-user.sh` via `set COMMANDLINE_ARGS=...` (Windows) / `export COMMANDLINE_ARGS="..."` (Linux), e.g.:

```
set COMMANDLINE_ARGS=--xformers --skip-torch-cuda-test --no-half-vae --api --ckpt-dir A:\stable-diffusion-checkpoints
```

UI settings save to `config.json`; element defaults to `ui-config.json`. Paths: models `models/Stable-Diffusion`, LoRAs `models/Lora`, embeddings `embeddings`, hypernetworks `models/hypernetworks`.

## Key features

- **Extra networks**: unified UI for Textual Inversion (use embedding filename), **LoRA** (`<lora:filename:multiplier>` in the prompt, multiplier ~0-1; not allowed in negative prompt; SD2.0+ LoRAs unsupported), and Hypernetworks (`<hypernet:filename:multiplier>`).
- **SDXL** support added in `1.5.0`; built-in refiner inference in `1.6.0`.
- **img2img / inpainting / outpainting**: draw or upload masks; "Inpaint area: Only masked" renders the masked region at higher resolution. See [[concepts/image-to-image-and-inpainting]].
- **Hires. fix**: render low-res, upscale, then a second high-res pass (latent and GAN upscalers, e.g. RealESRGAN/ESRGAN).
- **Prompt syntax**: attention `(word)` ×1.1, `[word]` ÷1.1, `(word:1.5)` explicit weight; prompt editing `[from:to:when]`; alternating `[cow|horse]`; `BREAK` keyword; Composable Diffusion `AND`. Infinite prompt length past 75 tokens. See [[concepts/prompting]].
- **Clip Skip** slider (all SDXL models trained on the penultimate layer).
- **X/Y/Z plot**, prompt matrix, Prompt S/R, face restoration (GFPGAN / CodeFormer), CLIP interrogator, checkpoint merger, TAESD lightweight VAE, PNG info, styles (`styles.csv`).

## Important command-line / launch flags

Networking & access:
- `--listen` — bind 0.0.0.0 for LAN access; `--port PORT` (default 7860, <1024 needs admin).
- `--share` — public Gradio `xxx.app.gradio` link; `--gradio-auth username:password`.
- `--api` — launch with the API; `--nowebui` — API only; `--api-auth user:pass`.
- `--autolaunch`, `--theme dark`, `--allow-code` (custom script code).

Device & precision:
- `--skip-torch-cuda-test` — bypass the CUDA check (needed on AMD/Mac); `--device-id`, `--use-cpu all`.
- `--no-half` (don't use fp16), `--no-half-vae` (keep VAE fp32), `--precision {full,half,autocast}`, `--upcast-sampling`.
- CPU-only requires: `--use-cpu all --precision full --no-half --skip-torch-cuda-test`.

Performance / memory (see [[concepts/optimization-and-memory]] and [[syntheses/troubleshooting-and-quality]]):
- `--xformers` — xFormers cross-attention (Nvidia only; great memory & speed improvement; deterministic as of xFormers 0.0.19).
- `--opt-sdp-attention` / `--opt-sdp-no-mem-attention` (the latter is deterministic; both need PyTorch 2.*).
- `--opt-sub-quad-attention`, `--opt-split-attention` (on by default for CUDA), `--opt-channelslast`.
- `--medvram` (split model into cond/first_stage/unet, only one in VRAM), `--medvram-sdxl`, `--lowvram` (split UNet into many modules — devastating for performance), `--lowram`.

Env vars: `COMMANDLINE_ARGS`, `VENV_DIR` (`-` disables the venv), `PYTHON`, `CUDA_VISIBLE_DEVICES`, `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512` (reduces long-run fragmentation).

RTX 3060 benchmark: xFormers "fastest and low memory"; `--medvram` decent savings, small speed hit; `--lowvram` "extremely slow due to constant swapping."


<!-- ===== stable-diffusion/wiki/entities/diffusers-library.md ===== -->

---
title: "Hugging Face Diffusers Library"
type: entity
tags: [diffusers, library, diffusionpipeline, autopipeline, from_pretrained]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-diffusers.md, raw/llms_txt_doc-diffusionpipeline.md, raw/llms_txt_doc-autopipeline.md, raw/llms_txt_doc-understanding-pipelines-models-and-schedulers.md, raw/llms_txt_doc-installation.md]
---
# Hugging Face Diffusers Library

Diffusers is Hugging Face's library of pretrained diffusion models for videos, images, and audio, built around the `DiffusionPipeline` — easy inference in a few lines, mix-and-match components (models, schedulers), and adapters like LoRA. Ships offloading and quantization for memory-constrained devices; supports `torch.compile`. This wiki documents Diffusers `v0.38.0`.

## Installation

Tested on Python 3.8+ and PyTorch 1.4+ (Python 3.8-3.11 on Windows). Install into a venv (full setup in [[concepts/installation-and-setup]]):

```bash
uv pip install diffusers["torch"] transformers
# or: conda install -c conda-forge diffusers
# from source: uv pip install git+https://github.com/huggingface/diffusers
```

Also install [Accelerate](https://huggingface.co/docs/accelerate/index). Weights cache to your home directory; relocate with `HF_HOME` / `HF_HUB_CACHE`, set `HF_HUB_OFFLINE=1` for offline.

## Architecture: pipelines, models, schedulers

Diffusion models have multiple components — UNets/DiTs, text encoders, VAEs, schedulers. The `DiffusionPipeline` wraps these into one API while keeping them swappable; at the core are **models** and **schedulers**, which you can unbundle to build your own system. See [[concepts/schedulers-and-samplers]] and [[concepts/what-is-stable-diffusion]].

## `from_pretrained` and `DiffusionPipeline`

`DiffusionPipeline` is a *model-only* base class: it scans `model_index.json` and returns the correct subclass (e.g. `StableDiffusionXLPipeline`).

```py
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, device_map="cuda")
```

Key `from_pretrained` arguments:
- `torch_dtype` — a single dtype, or a dict per model with a `"default"` key (defaults to `torch.float32`).
- `device_map` — `"cuda"` (one accelerator) or `"balanced"` (split evenly across GPUs); pair with `max_memory`. Inspect via `pipeline.hf_device_map`; reset with `reset_device_map()`.
- `vae=` / `unet=` etc. — replace individual components (e.g. `madebyollin/sdxl-vae-fp16-fix`).
- `safety_checker=None` — disable the safety checker on older SD models.

Reuse models across pipelines without extra memory via `from_pipe()`. For local/offline use, `snapshot_download` then pass the folder path.

## AutoPipeline

`AutoPipeline` is a *task-and-model* pipeline: it picks the right subclass for a task automatically. Three classes — `AutoPipelineForText2Image`, `AutoPipelineForImage2Image`, `AutoPipelineForInpainting`.

```py
from diffusers import AutoPipelineForImage2Image
pipeline = AutoPipelineForImage2Image.from_pretrained(
  "RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.bfloat16, device_map="cuda")  # -> StableDiffusionXLImg2ImgPipeline
```

`DiffusionPipeline.from_pretrained` on the same model returns the general `StableDiffusionXLPipeline`. AutoPipeline is also required for PAG; unsupported models raise `ValueError`.

## Related pages

This library is the backbone for [[concepts/text-to-image]], [[concepts/image-to-image-and-inpainting]], [[concepts/sdxl]], [[concepts/controlnet-and-adapters]], [[concepts/loras-for-inference]], and [[concepts/fine-tuning]]. The reference surface it does NOT page individually is mapped in [[summaries/model-and-feature-catalog]].


<!-- ===== stable-diffusion/wiki/log.md ===== -->

---
title: "Activity Log"
type: log
---

# Activity Log

Append-only record of all wiki changes.

## Format

Each entry follows this format:
```
### YYYY-MM-DD HH:MM — [Action Type]
- **Source/Trigger**: what initiated the action
- **Pages created**: list of new pages
- **Pages updated**: list of updated pages
- **Notes**: any contradictions flagged, decisions made
```

---

### 2026-04-08 00:00 — Setup

- **Source/Trigger**: Repository initialized
- **Pages created**: index.md, log.md, dashboard.md, analytics.md, flashcards.md
- **Pages updated**: none
- **Notes**: Empty knowledge base ready for first source ingestion

---

### 2026-06-23 — Initial curation (factory build)

- **Source/Trigger**: `new_wiki.py init stable-diffusion` — 106 sources gathered into `raw/` (Hugging Face Diffusers via llms.txt, AUTOMATIC1111 webui wiki, Stability AI / HF model cards)
- **Pages created**: 16 — 11 concepts (what-is-stable-diffusion, installation-and-setup, text-to-image, image-to-image-and-inpainting, prompting, sdxl, controlnet-and-adapters, schedulers-and-samplers, loras-for-inference, optimization-and-memory, fine-tuning), 2 entities (diffusers-library, automatic1111-webui), 1 summary (model-and-feature-catalog), 2 syntheses (choosing-model-and-pipeline, troubleshooting-and-quality)
- **Pages updated**: index.md (master catalog + stats), log.md
- **Notes**: Spine = Diffusers v0.38.0. Curated to the medium rung per RECIPE. Disambiguated the two LoRA docs (training vs inference). SD3 base has no single Hub ID in sources → listed as preview. The full Diffusers per-pipeline/per-API reference, non-SD pipelines, optimization backends, and quantization are mapped in the catalog, not paged.


<!-- ===== stable-diffusion/wiki/summaries/model-and-feature-catalog.md ===== -->

---
title: "Model and Feature Catalog (Map)"
type: summary
tags: [catalog, models, model-ids, pipelines, quantization, backends, map]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt-llms-txt-index.md, raw/llms_txt_doc-overview.md, raw/llms_txt_doc-model-formats.md, raw/llms_txt_doc-gguf.md, raw/web_community-stabilityai-stable-diffusion-xl-base-1-0-hugging-face.md, raw/web_community-stabilityai-stable-diffusion-3-5-large-hugging-face.md, raw/web_community-stable-diffusion-3-stability-ai.md]
---
# Model and Feature Catalog (Map)

A MAP page: the Stable Diffusion model IDs an agent will load, plus the larger Diffusers reference space this wiki does NOT page individually. Load exact values from the linked sources.

## Stable Diffusion model versions and IDs

| Model | Hub model ID | Notes |
|---|---|---|
| SD 1.4 | `CompVis/stable-diffusion-v1-4` | 512×512; used in Custom Diffusion examples |
| SD 1.5 | `stable-diffusion-v1-5/stable-diffusion-v1-5` | 512×512; most common base |
| SD 1.5 inpainting | `stable-diffusion-v1-5/stable-diffusion-inpainting` | inpainting variant |
| SD 2.x | (e.g. `stabilityai/stable-diffusion-2`, `stabilityai/stable-diffusion-2-base`) | 768/512; more sensitive to fp16 instability |
| SDXL base 1.0 | `stabilityai/stable-diffusion-xl-base-1.0` | 1024×1024; two text encoders (OpenCLIP-ViT/G + CLIP-ViT/L); CreativeML Open RAIL++-M license |
| SDXL refiner 1.0 | `stabilityai/stable-diffusion-xl-refiner-1.0` | second-stage detail refinement (ensemble of experts) |
| SDXL Turbo | `stabilityai/sdxl-turbo` | adversarial time-distilled SDXL; 1-step inference, `guidance_scale=0.0`, 512×512 |
| SD 3 | (preview / `StableDiffusion3Pipeline`) | MMDiT + flow matching; 800M-8B params range |
| SD 3.5 Large | `stabilityai/stable-diffusion-3.5-large` | MMDiT, three text encoders (CLIP-G, CLIP-L, T5-xxl); Stability Community License |
| LCM (SDXL) | `latent-consistency/lcm-sdxl` | distilled UNet, 2-4 step inference |
| LCM-LoRA | `latent-consistency/lcm-lora-sdxl`, `latent-consistency/lcm-lora-sdv1-5` | plug-in LoRA for fast inference |

Decision guidance: [[syntheses/choosing-model-and-pipeline]], [[concepts/sdxl]], [[concepts/what-is-stable-diffusion]].

## Model formats

- **Diffusers format** — each component (UNet/transformer, text encoder, VAE) in its own subfolder with `model_index.json` / `config.json`; load via `from_pretrained()`.
- **Single-file format** — all weights in one file; better compatibility with [[entities/automatic1111-webui]] and ComfyUI; load via `from_single_file()`.
- **File types** — `safetensors` (default, safe & fast), `ckpt` (legacy pickle, potentially unsafe), `dduf` (experimental). Conversion scripts in [`diffusers/scripts`](https://github.com/huggingface/diffusers/tree/main/scripts).

## Task pipelines for Stable Diffusion

Text-to-image, image-to-image, inpainting, depth-to-image, super-resolution/latent-upscale, image-variation, plus adapters: ControlNet (SD / SDXL / SD3), T2I-Adapter, IP-Adapter, InstructPix2Pix, DiffEdit. See [[concepts/text-to-image]], [[concepts/image-to-image-and-inpainting]], [[concepts/controlnet-and-adapters]].

## Other generative pipelines (not paged individually)

The Diffusers `v0.38.0` reference (see `llms.txt` index) covers far more than Stable Diffusion:
- **Other image families**: Kandinsky / Kandinsky 2.1 / 2.2 / 3 / 5.0, Würstchen, Stable Cascade, Flux / Flux2, PixArt-α/Σ, Sana, Kolors, AuraFlow, HiDream, Qwen-Image, OmniGen, DeepFloyd IF, Lumina.
- **Video**: Stable Video Diffusion (SVD), CogVideoX, AnimateDiff, Mochi, LTX-Video, HunyuanVideo, Wan, Allegro, Latte.
- **Audio / 3D**: Stable Audio, AudioLDM 2, Shap-E.

## Optimization backends (not paged individually)

ONNX Runtime, OpenVINO, Core ML (Apple), Metal Performance Shaders (MPS), AWS Neuron, Intel Gaudi, xFormers, attention backends, token merging, DeepCache / CacheDiT / caching, T-GATE, ParaAttention, xDiT, Pruna. For SD-specific memory/speed work see [[concepts/optimization-and-memory]] and [[syntheses/troubleshooting-and-quality]].

## Quantization (not paged individually)

bitsandbytes (4-bit NF4 used by SD 3.5 Large), torchao, **GGUF** (single-file, block-wise quant; load model classes via `from_single_file` with `GGUFQuantizationConfig`; supports BF16, Q4_0/Q4_1/Q5_0/Q5_1/Q8_0, Q2_K-Q6_K; pipeline loading not supported), quanto, NVIDIA ModelOpt. Fine-tuning techniques are mapped in [[concepts/fine-tuning]] and [[concepts/loras-for-inference]].


<!-- ===== stable-diffusion/wiki/syntheses/choosing-model-and-pipeline.md ===== -->

---
title: "Choosing a Model and Pipeline"
type: synthesis
tags: [decision-guide, model-selection, sdxl, turbo, lcm, license, vram]
updated: 2026-06-23
confidence: medium
sources: [raw/web_community-stabilityai-stable-diffusion-xl-base-1-0-hugging-face.md, raw/web_community-stabilityai-stable-diffusion-3-5-large-hugging-face.md, raw/web_community-stable-diffusion-3-stability-ai.md, raw/llms_txt_doc-stable-diffusion-xl.md, raw/llms_txt_doc-stable-diffusion-xl-turbo.md, raw/llms_txt_doc-latent-consistency-model.md]
---
# Choosing a Model and Pipeline

A decision guide weighing quality vs. speed vs. license vs. VRAM, then mapping the task to a pipeline. For exact model IDs see [[summaries/model-and-feature-catalog]].

## Which model?

**Quality at 1024×1024 → SDXL base 1.0** (`stabilityai/stable-diffusion-xl-base-1.0`). 3× larger UNet + a second text encoder (OpenCLIP-ViT/G + CLIP-ViT/L); beats SD 1.5 / 2.1 on user preference. Optionally add the **refiner** (`stable-diffusion-xl-refiner-1.0`) as a second-stage "ensemble of experts" (`denoising_end=0.8` on base, `denoising_start=0.8` on refiner). SDXL VAE is unstable in fp16 — use `madebyollin/sdxl-vae-fp16-fix`. See [[concepts/sdxl]].

**Prompt adherence / typography → SD 3.5 Large** (`stabilityai/stable-diffusion-3.5-large`). MMDiT with three text encoders (CLIP-G, CLIP-L, T5-xxl) + QK-normalization; SD3 combines a diffusion transformer with flow matching. Strong at spelling, multi-subject, complex prompts. Typical: `num_inference_steps=28`, `guidance_scale=3.5` (`StableDiffusion3Pipeline`, `torch.bfloat16`).

**Lightweight / broadest ecosystem → SD 1.5** (`stable-diffusion-v1-5/stable-diffusion-v1-5`). 512×512, smallest VRAM, most LoRAs/checkpoints, default for [[concepts/fine-tuning]] examples.

### Speed-optimized variants
- **SDXL Turbo** (`stabilityai/sdxl-turbo`): adversarial time-distilled SDXL, **as little as 1 step**. `guidance_scale=0.0` (trained without it); best at 512×512; 2-4 steps improves quality.
- **LCM / LCM-LoRA**: predict the denoised image directly in **2-4 steps** vs 20-30. Ideal `guidance_scale` [3., 13.] (or 1.0) for LCM, [1.0, 2.0] for LCM-LoRA. **Negative prompts don't work with LCM**. For SD 1.5, SDXL, SSD-1B. See [[concepts/schedulers-and-samplers]] and [[concepts/loras-for-inference]].

### License (decisive for commercial use)
- **SDXL base/refiner**: CreativeML Open RAIL++-M; card says "intended for research purposes only."
- **SD 3.5 Large**: Stability Community License — free for research, non-commercial, and commercial use under **$1M total annual revenue**; above $1M needs an Enterprise License. Hub repo is gated.

### VRAM quick take
SD 1.5 lowest; SDXL heavier (1024² + larger UNet; may not run on a Tesla T4); SD 3.5 Large heaviest — fits smaller GPUs via 4-bit NF4 bitsandbytes + `enable_model_cpu_offload()`. SDXL Turbo / LCM trade steps for speed, not peak VRAM. Memory tactics: [[concepts/optimization-and-memory]], [[syntheses/troubleshooting-and-quality]].

## Which task pipeline?

Use [[entities/diffusers-library]]'s `AutoPipeline` to map a task to the right subclass automatically:

| Task | Pipeline | Notes |
|---|---|---|
| **txt2img** | `AutoPipelineForText2Image` | the default; pass a prompt. See [[concepts/text-to-image]] |
| **img2img** | `AutoPipelineForImage2Image` | pass `image=` + `strength` (for Turbo, `num_inference_steps * strength >= 1`). See [[concepts/image-to-image-and-inpainting]] |
| **inpaint** | `AutoPipelineForInpainting` | pass `image=` + `mask_image=` |
| **controlled / conditioned** | ControlNet or T2I-Adapter pipelines | condition on canny/depth/pose; T2I-Adapter is lighter/faster than ControlNet but slightly worse. See [[concepts/controlnet-and-adapters]] |

A single `StableDiffusionXLPipeline` loaded via `DiffusionPipeline` can do txt2img, img2img, or inpaint depending on inputs; AutoPipeline returns a task-locked subclass. All accept LCM/LCM-LoRA acceleration and adapters.

## Rule of thumb
- Fast/interactive → **SDXL Turbo** or **LCM-LoRA** on your existing base.
- Best single-image quality at high res → **SDXL base (+refiner)**.
- Text/typography or complex multi-subject prompts → **SD 3.5 Large** (mind the license).
- Max compatibility, lightest VRAM, or cheap fine-tuning → **SD 1.5**.


<!-- ===== stable-diffusion/wiki/syntheses/troubleshooting-and-quality.md ===== -->

---
title: "Troubleshooting and Quality Casebook"
type: synthesis
tags: [troubleshooting, oom, nan, fp16, vae, reproducibility, seeds, quality]
updated: 2026-06-23
confidence: medium
sources: [raw/web_community-troubleshooting-md.md, raw/web_community-seed-breaking-changes-md.md, raw/llms_txt_doc-reduce-memory-usage.md, raw/llms_txt_doc-reproducibility.md, raw/web_community-optimizations-md.md]
---
# Troubleshooting and Quality Casebook

A symptom → cause → fix casebook. Pick the fix matching your stack ([[entities/automatic1111-webui]] = web UI flags; [[entities/diffusers-library]] = Python calls).

## CUDA out of memory (OOM)

Cause: model + activations exceed VRAM (worse at high res / large batches; SDXL & SD3 are heavy).

Web UI (escalating, each costs speed): `--opt-sdp-no-mem-attention` or `--xformers` (cuts memory ~half) → `--medvram` (4GB, ~1.3× larger images) → `--lowvram --always-batch-cond-uncond` → `--disable-model-loading-ram-optimization` (OOM loading a full-weight model, v1.6.0+). Add `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512`. Needs ~16GB system RAM; with 8GB use a page file or `--lowram`.

Diffusers (see [[concepts/optimization-and-memory]]):
- `enable_model_cpu_offload()` — whole models GPU↔CPU; faster, moderate savings.
- `enable_sequential_cpu_offload()` — submodule-level; large savings but **extremely slow**. Don't `.to("cuda")` first.
- `enable_group_offload(offload_type="block_level"|"leaf_level")` — between the two; `use_stream=True` overlaps transfer/compute.
- `enable_vae_slicing()` (batch decode), `enable_vae_tiling()` (large images) cut VAE peak memory.
- `device_map="balanced"` (split across GPUs); sharded checkpoints (`max_shard_size`); layerwise fp8 casting; `torch.channels_last`.

## Black / green screen or "Tensor with all NaNs in the VAE"

Cause: fp16 instability, often in the VAE; some GPUs lack half precision. SD 2.0/2.1 are especially fp16-sensitive (new cross-attention module); the SDXL VAE is unstable in fp16.

Web UI: green/black screen → `--upcast-sampling` (stacks with `--xformers`); else `--precision full --no-half` (high VRAM, may need `--medvram`). NaN-in-VAE → confirm with `--disable-nan-check`; Nvidia 16XX/10XX → `--upcast-sampling --xformers`, then `--no-half-vae` (VAE fp32), falling back to `--no-half`. AMD without fp16 → `--upcast-sampling --opt-sub-quad-attention` / `--opt-split-attention-v1`.

Diffusers: replace the SDXL VAE with `madebyollin/sdxl-vae-fp16-fix`, or keep the VAE in `float32` (`pipe.upcast_vae()` for SDXL Turbo). See [[concepts/sdxl]].

## "Torch is not able to use GPU"

Cause: no NVIDIA GPU / too-old card, or a broken CUDA/driver setup (AMD & Mac hit this).
Fix: on AMD/Mac/CPU add `--skip-torch-cuda-test` (CPU needs `--use-cpu all --precision full --no-half --skip-torch-cuda-test`). If broken after an update, undo changes and delete `venv`. Diagnostics: `python -m torch.utils.collect_env`. xformers "CUDA error: no kernel image" → `--reinstall-xformers --xformers` (Pascal+ on Windows, Python 3.10), then remove the reinstall flag.

## Non-reproducible / drifting seeds

Diffusers: pipelines use `torch.randn` (different each call); a `Generator` carries a *random state* that mutates once consumed, so reusing one in a loop drifts; CPU/GPU RNGs differ. Pass a fresh CPU `Generator` per call — `torch.Generator(device="cpu").manual_seed(0)` — for portable results. For strict determinism use `enable_full_determinism()` (sets `CUBLAS_WORKSPACE_CONFIG=:16:8`, disables cuDNN benchmark and TF32; slower). Even an identical seed isn't *guaranteed* identical across platforms.

Web UI: version seed-breaking changes — emphasis re-implementation (2022-09-29); LoRA via layer-weight alteration (2023-03-26, amplified by hires fix); DPM++ SDE batch-deterministic (2023-02-18); prompt-editing timeline split + fraction-vs-absolute step rules (1.6.0, `[red:green:0.25]` vs `[red:green:5]`); 1.8.0 zero-terminal-SNR (`alphas_cumprod` no longer fp16). Most ship a compatibility-page setting to restore old behavior. For GPU/CPU parity set "Random number generator source = CPU" (also required to match Stability's SDXL reference output).

## Poor quality / low detail

Cause: too few steps, wrong guidance, a low-res model past its training resolution, or a distilled model used like a normal one.
Fix: match steps/guidance to the model — SDXL Turbo `guidance_scale=0.0` (1-4 steps), LCM 2-4 steps with `guidance_scale` in its trained range, SD 3.5 ~28 steps / `guidance_scale≈3.5`. SD1/2 degrade far above 512/768 px → **Hires. fix** (low-res pass, upscale, second pass) + `Extra noise multiplier` (below denoising strength). Use a negative prompt (except LCM). Add the SDXL **refiner** for final detail. See [[concepts/prompting]] and [[concepts/text-to-image]].

## Slow generation

Cause: unoptimized attention, offloading overhead, or a slow scheduler/step count.
Web UI: `--xformers` (fastest + low memory per the RTX 3060 benchmark; `--opt-sdp-attention` can beat it but uses more VRAM). Avoid `--lowvram` unless forced. Disable browser hardware acceleration and GPU hardware scheduling.
Diffusers: compile the UNet (`torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)`); fuse LoRAs before compiling; prefer model offload over sequential; use a few-step model (SDXL Turbo / LCM). See [[concepts/optimization-and-memory]].