AI Toolkit Inference

Reproducible Diffusers LoRA inference pipelines for adapters trained with ostris/ai-toolkit.

Model Catalog · GitHub Repo

Docs Home · Model Catalog · Troubleshooting

HTTP API Reference

The ai-toolkit-inference server exposes a small HTTP/REST API for running LoRAs trained with ostris/ai-toolkit on top of curated Diffusers pipelines.

For per-model behavior and defaults, cross-check the Model Catalog.

Quick navigation

This API is intentionally simple:

Note (optional): if your main blocker is environment drift (CUDA/PyTorch/Diffusers versions, large model downloads, custom pipeline deps), it helps to run the server in a fixed runtime/container. RunComfy provides a managed runtime for AI Toolkit training + inference, but any reproducible GPU environment works — the reference behavior is still defined by this repo.


Base URL

The server uses BASE_URL (see src/config.py) to construct status_url and result_url. Default is usually:


Endpoints

Health

Start an inference

This creates an async task and returns URLs you can poll.

Poll status

Fetch result

List supported models + defaults

OpenAPI / interactive docs (served by FastAPI):

This is the canonical way to discover:


Async lifecycle

1) POST /v1/inference → returns request_id, status_url, result_url 2) Poll GET /v1/requests/{request_id}/status 3) Fetch GET /v1/requests/{request_id}/result

Notes:


Request schema

Source of truth: src/schemas/request.py

Top-level fields

{
  "model": "flux",
  "hf_token": "hf_*** (optional)",
  "trigger_word": "sks (optional)",
  "prompts": [
    {
      "prompt": "[trigger] a cinematic photo",
      "neg": "",
      "width": 1024,
      "height": 1024,
      "sample_steps": 25,
      "guidance_scale": 4.0,
      "seed": 42
    }
  ],
  "loras": [
    {
      "path": "my_job/my_job.safetensors",
      "network_multiplier": 1.0
    }
  ]
}

Field details:

LoRA path resolution

The server resolves each LoRA path as:

{WORKFLOWS_BASE_PATH}/{loras[].path} (unless the path is a URL or absolute)

Implementation: src/api/v1/inference.py.

Common layouts:

MoE LoRA format (Wan 2.2 14B)

For:

loras must include transformer-tagged items for low/high:

"loras": [
  {"path": "low_noise.safetensors", "transformer": "low", "network_multiplier": 1.0},
  {"path": "high_noise.safetensors", "transformer": "high", "network_multiplier": 1.0}
]

You may provide only one side if you trained only one LoRA.


Prompt item fields

Prompt items are objects inside prompts: [...].

Common fields:

Video-only fields (only respected by video pipelines):

Control image fields:


Control image formats

Source of truth for loading: src/libs/image_utils.py

The server accepts a string. If it starts with http, it’s treated as a URL; otherwise it’s treated as base64.

1) URL

"ctrl_img": "https://example.com/image.png"

2) Base64 (raw or data URL)

"ctrl_img": "iVBORw0KGgoAAAANSUhEUgAA..."
"ctrl_img": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."

Responses

Source of truth: src/schemas/response.py.

POST /v1/inference response

{
  "request_id": "b3e6e2e0-...",
  "status": "queued",
  "status_url": "http://localhost:8000/v1/requests/b3e6e2e0-.../status",
  "result_url": "http://localhost:8000/v1/requests/b3e6e2e0-.../result",
  "created_at": "2026-01-17T00:00:00Z"
}

Status response

{
  "request_id": "b3e6e2e0-...",
  "status": "in_queue|in_progress|succeeded|failed|canceled",
  "created_at": "...",
  "started_at": "...",
  "finished_at": "...",
  "error": null
}

Result response

{
  "request_id": "b3e6e2e0-...",
  "status": "succeeded",
  "outputs": {
    "images": [
      {"format": "jpeg", "width": 1024, "height": 1024, "file_path": "/tmp/inference_output/b3e6e2e0-..._output_0.jpg", "seed": 42}
    ],
    "videos": [
      {"format": "mp4", "width": 832, "height": 480, "num_frames": 41, "fps": 16, "file_path": "/tmp/inference_output/b3e6e2e0-..._output_0.mp4", "seed": 42}
    ]
  },
  "error": null,
  "created_at": "...",
  "finished_at": "...",
  "metadata": {"timings": {"load_pipeline": 12.3, "prompts": [...]}}
}

Notes:


Error codes (common)

Errors are returned with structured JSON. Common code values include:

Implementation: src/api/v1/inference.py.


Practical examples

1) Text-to-image (FLUX)

curl -X POST http://localhost:8000/v1/inference \
  -H 'content-type: application/json' \
  -d '{
    "model": "flux",
    "trigger_word": "sks",
    "loras": [{"path": "my_flux_job/my_flux_job.safetensors", "network_multiplier": 1.0}],
    "prompts": [{
      "prompt": "[trigger] a cinematic portrait photo",
      "width": 1024,
      "height": 1024,
      "sample_steps": 25,
      "guidance_scale": 4.0,
      "seed": 42
    }]
  }'

2) Control-image editing (FLUX Kontext)

curl -X POST http://localhost:8000/v1/inference \
  -H 'content-type: application/json' \
  -d '{
    "model": "flux_kontext",
    "loras": [{"path": "my_kontext_job/my_kontext_job.safetensors", "network_multiplier": 1.0}],
    "prompts": [{
      "prompt": "make it look like a watercolor",
      "width": 1024,
      "height": 1024,
      "ctrl_img": "https://example.com/input.png",
      "seed": 42
    }]
  }'

3) Wan 2.2 MoE video (T2V)

curl -X POST http://localhost:8000/v1/inference \
  -H 'content-type: application/json' \
  -d '{
    "model": "wan22_14b_t2v",
    "loras": [
      {"path": "my_wan_job/low_noise.safetensors", "transformer": "low", "network_multiplier": 1.0},
      {"path": "my_wan_job/high_noise.safetensors", "transformer": "high", "network_multiplier": 1.0}
    ],
    "prompts": [{
      "prompt": "a cinematic shot of a city at night",
      "width": 1280,
      "height": 720,
      "num_frames": 41,
      "fps": 16,
      "sample_steps": 25,
      "guidance_scale": 4.0,
      "seed": 42
    }]
  }'

4) Python (requests) + polling

import time

import requests


BASE_URL = "http://localhost:8000"

payload = {
    "model": "zimage_turbo",
    "trigger_word": "sks",
    "loras": [{"path": "my_lora_job/my_lora_job.safetensors", "network_multiplier": 1.0}],
    "prompts": [
        {
            "prompt": "[trigger] a photo of a person",
            "width": 1024,
            "height": 1024,
            "seed": 42,
            "sample_steps": 8,
            "guidance_scale": 1.0,
            "neg": "",
        }
    ],
}

# 1) Start request
r = requests.post(f"{BASE_URL}/v1/inference", json=payload, timeout=30)
r.raise_for_status()
job = r.json()
request_id = job["request_id"]

# 2) Poll status
while True:
    s = requests.get(f"{BASE_URL}/v1/requests/{request_id}/status", timeout=30)
    s.raise_for_status()
    status = s.json()["status"]
    if status in ("succeeded", "failed", "canceled"):
        break
    time.sleep(1)

# 3) Fetch result
result = requests.get(f"{BASE_URL}/v1/requests/{request_id}/result", timeout=30)
result.raise_for_status()
print(result.json())

Reminder: file_path values in results are local paths on the server (this repo does not ship object storage/static hosting by default).


Next: model-specific docs

Once you can hit the API successfully, the fastest way to get “preview matching” is to follow the per-model page: