Reproducible Diffusers LoRA inference pipelines for adapters trained with ostris/ai-toolkit.
← Docs Home · Model Catalog · HTTP API
This page is a practical checklist for the most common “my AI Toolkit samples look great, but inference looks different / broken” problems.
It’s written specifically for:
ostris/ai-toolkitIf you want a known-good baseline, start by running the corresponding by-model pipeline from this repo, then change one variable at a time.
When outputs drift from training samples, it’s almost always one of these:
1) Wrong model id / wrong base checkpoint
2) Resolution rules don’t match
resolution_divisor (per model).3) Scheduler / steps / guidance differ
true_cfg_scale).4) Control image wiring differs
5) LoRA isn’t being applied the way you think
fuse_lora vs model-specific transformer merges.| Symptom | Most likely cause | What to do in this repo |
|---|---|---|
| AI Toolkit samples look good, Diffusers/ComfyUI inference looks different | different base checkpoint / scheduler / resolution snapping | Start from the model page and copy its defaults; also check snapped dimensions and LoRA load mode. |
| LoRA “does nothing” (looks like base model) | wrong trigger word / scale 0 / incompatible LoRA keys / wrong pipeline family | Verify trigger_word, set loras[].network_multiplier=1.0, and use the exact model pipeline. For some families, LoRA is fused so scale is not dynamically adjustable. |
API returns CONTROL_IMAGE_REQUIRED |
edit/I2V model needs ctrl_img |
Provide ctrl_img (URL or base64) in the prompt item. |
API returns MOE_FORMAT_REQUIRED, SINGLE_MOE_CONFIG_ONLY, or SINGLE_LORA_ONLY |
Wan 2.2 MoE format required (or multiple configs sent), or multiple LoRAs sent | Use loras with transformer: "low" / "high" for Wan 2.2 14B (single config); otherwise send exactly one LoRA item. |
| Wan2.2 motion is “too fast / weird” vs samples | frames/fps mismatch; I2V conditioning mismatch | Match num_frames + fps defaults from the model page; keep resolution within the checkpoint regime. |
| Download stalls at ~99% | HF transfer/Xet edge cases | See the Hugging Face download section below; this repo already applies mitigations. |
| OOM / CUDA out of memory | too-large res/frames, heavy pipeline, CPU offload disabled for that model | Reduce width/height/num_frames, try smaller model ids, enable CPU offload where supported. |
This is a very common report across model families (FLUX/HiDream/Qwen/Wan). Users often find that:
Examples (for context):
What usually fixes it:
width//divisor*divisor)If you’re not sure which model you are actually using, call GET /v1/models (see HTTP API) and compare defaults.
This shows up as “the output looks exactly like the base model.” Common causes:
AI Toolkit commonly uses a trigger token that you must include in the inference prompt.
This server supports a convenience placeholder:
[trigger] in your prompttrigger_word: "..." at the request levelThe server replaces [trigger] in every prompt item (see executor code in src/tasks/executor.py).
loras[].network_multiplier changes LoRA strength dynamically via set_lora_scale().fuse_lora), the LoRA is merged at load time, so changing loras[].network_multiplier requires a reload.The API uses a single LoRA scale per request (per transformer for MoE). If you need multiple scales, send separate requests.
Some users report Qwen-Image LoRAs not working in other stacks due to weight key naming expectations.
Example report: “Qwen-Image LoRAs not working in ComfyUI” (ai-toolkit #372).
If your LoRA works in AI Toolkit samples but not elsewhere:
true_cfg_scale and aligns prompt/image encoding with AI Toolkit).Some model families require newer or customized Diffusers behavior for LoRA injection.
Example: LoRA not affecting FLUX inference due to compatibility issues was discussed in Diffusers (diffusers #9361).
This repo pins Diffusers to a specific revision in requirements-inference.txt to reduce “works in one environment but not another” failures.
Most 400s are intentional guardrails.
CONTROL_IMAGE_REQUIREDModels that require a control image:
Fix:
ctrl_img (or ctrl_img_1..3 for multi-image models) inside each prompt item.MOE_FORMAT_REQUIRED (Wan 2.2 14B)Wan 2.2 14B uses a dual-transformer MoE setup. This server expects:
"loras": [
{"path": "low_noise.safetensors", "transformer": "low", "network_multiplier": 1.0},
{"path": "high_noise.safetensors", "transformer": "high", "network_multiplier": 1.0}
]
Context: users frequently ask about the “low/high noise” split and when each matters (e.g. ai-toolkit #349).
Practical tip:
low or high).SINGLE_MOE_CONFIG_ONLY).Users have reported Wan2.2 I2V training/inference oddities, including motion artifacts that can correlate with frame/time settings (see ai-toolkit #421).
What to try:
num_frames=41, fps=16, steps/guidance defaults.Some preview models ship with custom or rapidly changing pipeline code.
Example: Flex.2 users discuss Diffusers incompatibilities and the need for custom pipelines (Flex.2 discussion).
Fix:
This is a real-world issue when downloading large model repos.
Example reports:
hf_transfer download stuck at 99% (hf_transfer #30)huggingface_hub snapshot_download stalls (huggingface_hub #2197)This repo includes mitigations in src/services/pipeline_manager.py (environment vars set before snapshot_download). If you still hit stalls:
hf_transfer / HF_HUB_ENABLE_HF_TRANSFERHF_TOKEN / pass hf_token if the model is gatedSome users assume the training config must be present to run inference (see ai-toolkit #416).
This server does not read the AI Toolkit training config to infer parameters automatically. Instead:
If you can share:
GET /v1/models)…you can usually pinpoint the mismatch quickly.
Note: if your main blocker is environment drift (CUDA/PyTorch/Diffusers versions, or missing model-specific pipeline deps), it can help to run training + inference in a fixed runtime/container. RunComfy provides a managed runtime for AI Toolkit, but any reproducible GPU environment works — the reference behavior is still defined by the code in this repo.