Tuning & Troubleshooting

VRAM-smart defaults, practical knobs, and quick fixes for LoRA & DreamBooth.

Quick checklist

  • Clean data: one subject/style per folder; remove blurs/near-dupes.
  • Family match: SD 1.5 vs SDXL must match your base checkpoint and your generation plan.
  • System headroom: check Launcher → System for VRAM & disk before training.
  • A1111 only for auto-caption: run it for captioning, then stop to free VRAM.
System panel showing VRAM and disk usage
System panel — confirm VRAM/disk headroom before you hit Start.
Family mismatch warning example
Family mismatch symptoms — muddy results or model not appearing.

VRAM strategy ladder

  1. Lower resolution (long side): first & biggest lever.
  2. Batch = 1; increase grad-accum to simulate larger batch.
  3. Keep precision FP16 from presets; avoid exotic half-modes unless you know them well.
  4. Close other GPU apps; only keep A1111 on for captioning.
  5. If still OOM: shorten steps for a test pass, then resume longer.
VRAM ladder diagram
Lower resolution → batch 1 → grad-accum — safest path.
Batch vs gradient accumulation panel
Batch stays small; use accumulation to approximate larger effective batch.

Resolution & aspect

  • Portrait LoRA: start long side ~640–768 px.
  • Full-body DB: start with a taller canvas (e.g., 832×1216 class) but keep moderate first.
  • Normalize sizes to reduce gradient noise from extreme aspect shifts.

Optimizer & learning rate

Presets ship stable combos. If you must tweak:

  • If loss explodes/NaN: lower LR slightly; keep optimizer as preset.
  • If underfitting: extend steps after verifying data quality.
  • Avoid stacking many “advanced” toggles at once — change one thing and re-test.

Steps, epochs & resume

  • Start modest; evaluate mid-run samples if enabled.
  • Use Resume from last good step instead of restarting from scratch.
  • Overtraining shows as oversaturated, waxy faces, or collapsed diversity — stop earlier next time.
Resume training dialog
Resume picks up from the last saved step.

Dataset pitfalls

  • Mixed subjects in one folder → identity drift.
  • Heavy compression → artifacts the model will learn.
  • Caption noise → conflicting tokens; keep short & consistent.
  • Missing limbs in full-body sets → add clear, upright poses.

Reproducibility & versioning

Training Center exports params.json with your job. Keep it with your outputs.

  • Use semantic names: mascot_sdxl_fb_v1.safetensors, ..._v2 for later runs.
  • Keep a tiny “sanity” dataset to quickly validate new settings before full runs.
params.json next to outputs
Keep params.json with the artifact for exact repro.
Versioned model naming examples
Version your models — it saves you from “which one was good?”.

Common issues & fixes

OOM (out of memory)

  • Lower resolution → set batch = 1 → use grad-accum.
  • Close A1111 and other GPU apps; keep FP16.

Exploding loss / NaN

  • Lower learning rate; revert any recent “advanced” toggles.
  • Audit data for corrupted images or extreme outliers.

No improvement / muddy results

  • Confirm family and checkpoint match (1.5 vs XL).
  • Improve dataset: sharper images, consistent framing, better captions.
  • Train a bit longer, but watch for overbake symptoms.

Overbake / “waxy” look

  • Stop earlier; reduce steps or add varied but on-style samples.
  • For LoRA inference, try weight 0.6–0.8 first.

Throughput too slow

  • Lower resolution slightly; keep batch at 1 with accumulation.
  • Ensure the dataset is on a fast local path with enough free disk.

After training — A1111 usage

  • LoRA: refresh the LoRA list; insert <lora:NAME:0.6–0.8> and iterate.
  • DreamBooth: refresh the Stable Diffusion checkpoint list; keep family consistent.
  • ControlNet: layer it after selecting your checkpoint if needed.
A1111 refresh model lists
Refresh model lists in A1111 to pick up new outputs.
Side-by-side comparison of weights/steps
Compare weights/steps side-by-side to find the sweet spot.

What’s next?

Change one knob. Test. Repeat.