An implementation of SVDiff: Compact Parameter Space for Diffusion Fine-Tuning by using d🧨ffusers.
My summary tweet is found here.
Compared with LoRA, the number of trainable parameters is 0.5 M less parameters and the file size is only 1.2MB (LoRA: 3.1MB)!!
- Released v0.2.0 (please see here for the details). By this change, you get better results with less training steps than the first release v0.1.1!!
- Add Single Image Editing
"photo of apinkblue chair with black legs" (without DDIM Inversion)
$ pip install svdiff-pytorch
Or, manually
$ git clone https://github.com/mkshing/svdiff-pytorch
$ pip install -r requirements.txt
"Single-Subject Generation" is a domain-tuning on a single object or concept (using 3-5 images). (See Section 4.1)
According to the paper, the learning rate for SVDiff needs to be 1000 times larger than the lr used for fine-tuning.
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="path-to-instance-images"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"
accelerate launch train_svdiff.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="photo of a sks dog" \
--class_prompt="photo of a dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-3 \
--learning_rate_1d=1e-6 \
--train_text_encoder \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=500
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
from svdiff_pytorch import load_unet_for_svdiff, load_text_encoder_for_svdiff
pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5"
spectral_shifts_ckpt_dir = "ckpt-dir-path"
unet = load_unet_for_svdiff(pretrained_model_name_or_path, spectral_shifts_ckpt=spectral_shifts_ckpt_dir, subfolder="unet")
text_encoder = load_text_encoder_for_svdiff(pretrained_model_name_or_path, spectral_shifts_ckpt=spectral_shifts_ckpt_dir, subfolder="text_encoder")
# load pipe
pipe = StableDiffusionPipeline.from_pretrained(
pretrained_model_name_or_path,
unet=unet,
text_encoder=text_encoder,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
You can use the following CLI too! Once it's done, you will see grid.png
for the result.
python inference.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--spectral_shifts_ckpt="ckpt-dir-path" \
--prompt="A picture of a sks dog in a bucket" \
--scheduler_type="dpm_solver++" \
--num_inference_steps=25 \
--num_images_per_prompt=2
Endpoint: /generate-image
Method: POST Description: Generates an image based on the provided prompt. Request Body:
{
"prompt": "string",
"num_inference_steps": integer
}
prompt: The text prompt describing the desired image.
num_inference_steps (optional): The number of inference steps to perform. Default is 25.
Response Body:
{
"image": "string (base64-encoded image bytes)"
}
-
FastAPI: The code utilizes the FastAPI framework, which provides a high-performance API implementation with automatic request/response parsing and validation.
-
CORS Middleware: The script includes middleware to handle Cross-Origin Resource Sharing (CORS) by allowing requests from any origin (*).
-
Request and Response Models: The script defines Pydantic models (GenerateImageRequest and GenerateImageResponse) to validate the request payload and response data structure.
-
Exception Handling: An exception handler is defined to catch HTTPException and return an appropriate JSON response with error details.
-
Asynchronous Image Generation: The generate_image method in the ImageGenerator class is an asynchronous function (async) that uses asyncio.to_thread to run the image generation process concurrently in a separate thread. This allows the API to handle multiple requests concurrently without blocking the main event loop.
-
Preloading Model: The ImageGenerator class initializes the pre-trained model and related components (unet and text_encoder) in its constructor. This allows the model to be loaded only once during the initialization phase instead of loading it for every image generation request, optimizing the performance.
-
GPU Acceleration: The code leverages GPU acceleration by using the .to("cuda") method to move the model and associated tensors to the GPU for faster image generation.
-
Error Handling: The generate_image method catches exceptions during image generation and raises an HTTPException with an appropriate error message and status code (HTTP_500_INTERNAL_SERVER_ERROR) to provide meaningful feedback to the API clients.
-
Efficient Image Conversion: Assuming the generated image is in the PIL format, the code converts it to bytes using the tobytes() method. This conversion avoids unnecessary data duplication and ensures efficient transmission of the generated image as part of the response.
-
Multistep Scheduler: The ImageGenerator class utilizes the DPMSolverMultistepScheduler for scheduling the diffusion process during image generation. This scheduler can optimize the performance and quality of the generated images by using multiple inference steps.
docker build -t image-generator .
docker build -t image-generator .
The FastAPI application should now be running inside the Docker container. You can access it by opening a web browser and navigating to http://localhost:8000. If you're running Docker on a remote machine or using Docker Toolbox on Windows, replace localhost with the IP address of the Docker host.
In Single Image Editing, your instance prompt should be just the description of your input image without the identifier.
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="dir-path-to-input-image"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"
accelerate launch train_svdiff.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="photo of a pink chair with black legs" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-3 \
--learning_rate_1d=1e-6 \
--train_text_encoder \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500
import torch
from PIL import Image
from diffusers import DDIMScheduler
from svdiff_pytorch import load_unet_for_svdiff, load_text_encoder_for_svdiff, StableDiffusionPipelineWithDDIMInversion
pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5"
spectral_shifts_ckpt_dir = "ckpt-dir-path"
image = "path-to-image"
source_prompt = "prompt-for-image"
target_prompt = "prompt-you-want-to-generate"
unet = load_unet_for_svdiff(pretrained_model_name_or_path, spectral_shifts_ckpt=spectral_shifts_ckpt_dir, subfolder="unet")
text_encoder = load_text_encoder_for_svdiff(pretrained_model_name_or_path, spectral_shifts_ckpt=spectral_shifts_ckpt_dir, subfolder="text_encoder")
# load pipe
pipe = StableDiffusionPipelineWithDDIMInversion.from_pretrained(
pretrained_model_name_or_path,
unet=unet,
text_encoder=text_encoder,
)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
# (optional) ddim inversion
# if you don't do it, inv_latents = None
image = Image.open(image).convert("RGB").resize((512, 512))
# in SVDiff, they use guidance scale=1 in ddim inversion
# They use target_prompt in DDIM inversion for better results. See below for comparison between source_prompt and target_prompt.
inv_latents = pipe.invert(target_prompt, image=image, guidance_scale=1.0).latents
# They use a small cfg scale in Single Image Editing
image = pipe(target_prompt, latents=inv_latents, guidance_scale=3, eta=0.5).images[0]
DDIM inversion with target prompt (left) v.s. source prompt (right):
"photo of a grey Beetle Mustang car" (original image: https://unsplash.com/photos/YEPDV3T8Vi8)
To use slerp to add more stochasticity,
from svdiff_pytorch.utils import slerp_tensor
# prev steps omitted
inv_latents = pipe.invert(target_prompt, image=image, guidance_scale=1.0).latents
noise_latents = pipe.prepare_latents(inv_latents.shape[0], inv_latents.shape[1], 512, 512, dtype=inv_latents.dtype, device=pipe.device, generator=torch.Generator("cuda").manual_seed(0))
inv_latents = slerp_tensor(0.5, inv_latents, noise_latents)
image = pipe(target_prompt, latents=inv_latents).images[0]
You can also try SVDiff-pytorch in a UI with gradio. This demo supports both training and inference!
If you want to run it locally, run the following commands step by step.
$ git clone --recursive https://github.com/mkshing/svdiff-pytorch.git
$ cd scripts/gradio
$ pip install -r requirements.txt
$ export HF_TOKEN="YOUR_HF_TOKEN_HERE"
$ python app.py
You can adjust the strength of the weights by --spectral_shifts_scale
Here's a result for 0.8, 1.0, 1.2 (1.0 is the default).
By using ToMe for SD, the prior generation can be faster!
$ pip install tomesd
And, add --enable_tome_merging
to your training arguments!
@misc{https://doi.org/10.48550/arXiv.2303.11305,
title = {SVDiff: Compact Parameter Space for Diffusion Fine-Tuning},
author = {Ligong Han and Yinxiao Li and Han Zhang and Peyman Milanfar and Dimitris Metaxas and Feng Yang},
year = {2023},
eprint = {2303.11305},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2303.11305}
}
@misc{hu2021lora,
title = {LoRA: Low-Rank Adaptation of Large Language Models},
author = {Hu, Edward and Shen, Yelong and Wallis, Phil and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Chen, Weizhu},
year = {2021},
eprint = {2106.09685},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
@article{bolya2023tomesd,
title = {Token Merging for Fast Stable Diffusion},
author = {Bolya, Daniel and Hoffman, Judy},
journal = {arXiv},
url = {https://arxiv.org/abs/2303.17604},
year = {2023}
}
- Training
- Inference
- Scaling spectral shifts
- Support Single Image Editing
- Support multiple spectral shifts (Section 3.2)
- Cut-Mix-Unmix (Section 3.3)
- SVDiff + LoRA