FLUX (by Black Forest Labs) has taken the world of AI image generation by storm in the last few months. Not only has it beat Stable Diffusion (the prior open-source king) on many benchmarks, it has also surpassed proprietary models like Dall-E or Midjourney in some metrics.
But how would you go about using FLUX on one of your apps? One might think of using serverless hosts like Replicate and others, but these can get very expensive very quickly, and may not provide the flexibility you need. That's where creating your own custom FLUX server comes in handy.
In this article, we'll walk you through creating your own FLUX server using Python. This server will allow you to generate images based on text prompts via a simple API. Whether you're running this server for personal use or deploying it as part of a production application, this guide will help you get started.
Before diving into the code, let's ensure you have the necessary tools and libraries set up:
You can install all the libraries by running the following command: pip install torch diffusers transformers sentencepiece protobuf accelerate fastapi uvicorn.
If you're using a Mac with an M1 or M2 chip, you should set up PyTorch with Metal for optimal performance. Follow the official PyTorch with Metal guide before proceeding.
You'll also need to make sure you have at least 12 GB of VRAM if you're planning on running FLUX on a GPU device. Or at least 12 GB of RAM for running on CPU/MPS (which will be slower).
Let's start the script by picking the right device to run inference based on the hardware we're using.
device = 'cuda' # can also be 'cpu' or 'mps' import os # MPS support in PyTorch is not yet fully implemented if device == 'mps': os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" import torch if device == 'mps' and not torch.backends.mps.is_available(): raise Exception("Device set to MPS, but MPS is not available") elif device == 'cuda' and not torch.cuda.is_available(): raise Exception("Device set to CUDA, but CUDA is not available")
You can specify cpu, cuda (for NVIDIA GPUs), or mps (for Apple's Metal Performance Shaders). The script then checks if the selected device is available and raises an exception if it's not.
Next, we load the FLUX model. We'll load the model in fp16 precision which will save us some memory without much loss in quality.
At this point, you may be asked to authenticate with HuggingFace, as the FLUX model is gated. In order to authenticate successfully, you'll need to create a HuggingFace account, go to the model page, accept the terms, and then create a HuggingFace token from your account settings and add it on your machine as the HF_TOKEN environment variable.
device = 'cuda' # can also be 'cpu' or 'mps' import os # MPS support in PyTorch is not yet fully implemented if device == 'mps': os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" import torch if device == 'mps' and not torch.backends.mps.is_available(): raise Exception("Device set to MPS, but MPS is not available") elif device == 'cuda' and not torch.cuda.is_available(): raise Exception("Device set to CUDA, but CUDA is not available")
Here, we're loading the FLUX model using the diffusers library. The model we're using is black-forest-labs/FLUX.1-dev, loaded in fp16 precision.
There is also a timestep-distilled model named FLUX Schnell which has faster inference, but outputs less detailed images, as well as a FLUX Pro model which is closed-source.
We'll use the Euler scheduler here, but you may experiment with this. You can read more on schedulers here.
Since image generation can be resource-intensive, it's crucial to optimize memory usage, especially when running on a CPU or a device with limited memory.
from diffusers import FlowMatchEulerDiscreteScheduler, FluxPipeline import psutil model_name = "black-forest-labs/FLUX.1-dev" print(f"Loading {model_name} on {device}") pipeline = FluxPipeline.from_pretrained( model_name, # Diffusion models are generally trained on fp32, but fp16 # gets us 99% there in terms of quality, with just half the (V)RAM torch_dtype=torch.float16, # Ensure we don't load any dangerous binary code use_safetensors=True # We are using Euler here, but you can also use other samplers scheduler=FlowMatchEulerDiscreteScheduler() ).to(device)
This code checks the total available memory and enables attention slicing if the system has less than 64 GB of RAM. Attention slicing reduces memory usage during image generation, which is essential for devices with limited resources.
Next, we'll set up the FastAPI server, which will provide an API to generate images.
# Recommended if running on MPS or CPU with < 64 GB of RAM total_memory = psutil.virtual_memory().total total_memory_gb = total_memory / (1024 ** 3) if (device == 'cpu' or device == 'mps') and total_memory_gb < 64: print("Enabling attention slicing") pipeline.enable_attention_slicing()
FastAPI is a popular framework for building web APIs with Python. In this case, we're using it to create a server that can accept requests for image generation. We're also using GZip middleware to compress the response, which is particularly useful when sending images back in base64 format.
In a production environment, you might want to store the generated images in an S3 bucket or other cloud storage and return the URLs instead of the base64-encoded strings, to take advantage of a CDN and other optimizations.
We now need to define a model for the requests that our API will accept.
from fastapi import FastAPI, HTTPException from pydantic import BaseModel, Field, conint, confloat from fastapi.middleware.gzip import GZipMiddleware from io import BytesIO import base64 app = FastAPI() # We will be returning the image as a base64 encoded string # which we will want compressed app.add_middleware(GZipMiddleware, minimum_size=1000, compresslevel=7)
This GenerateRequest model defines the parameters required to generate an image. The prompt field is the text description of the image you want to create. Other fields include the image dimensions, the number of inference steps, and the batch size.
Now, let's create the endpoint that will handle image generation requests.
device = 'cuda' # can also be 'cpu' or 'mps' import os # MPS support in PyTorch is not yet fully implemented if device == 'mps': os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" import torch if device == 'mps' and not torch.backends.mps.is_available(): raise Exception("Device set to MPS, but MPS is not available") elif device == 'cuda' and not torch.cuda.is_available(): raise Exception("Device set to CUDA, but CUDA is not available")
This endpoint handles the image generation process. It first validates that the height and width are multiples of 8, as required by FLUX. It then generates images based on the provided prompt and returns them as base64-encoded strings.
Finally, let's add some code to start the server when the script is run.
from diffusers import FlowMatchEulerDiscreteScheduler, FluxPipeline import psutil model_name = "black-forest-labs/FLUX.1-dev" print(f"Loading {model_name} on {device}") pipeline = FluxPipeline.from_pretrained( model_name, # Diffusion models are generally trained on fp32, but fp16 # gets us 99% there in terms of quality, with just half the (V)RAM torch_dtype=torch.float16, # Ensure we don't load any dangerous binary code use_safetensors=True # We are using Euler here, but you can also use other samplers scheduler=FlowMatchEulerDiscreteScheduler() ).to(device)
This code starts the FastAPI server on port 8000, making it accessible not only from http://localhost:8000 but also from other devices on the same network using the host machine’s IP address, thanks to the 0.0.0.0 binding.
Now that your FLUX server is up and running, it's time to test it. You can use curl, a command-line tool for making HTTP requests, to interact with your server:
# Recommended if running on MPS or CPU with < 64 GB of RAM total_memory = psutil.virtual_memory().total total_memory_gb = total_memory / (1024 ** 3) if (device == 'cpu' or device == 'mps') and total_memory_gb < 64: print("Enabling attention slicing") pipeline.enable_attention_slicing()
This command will only work on UNIX-based systems with the curl, jq and base64 utilities installed. It may also take up to a few minutes to complete depending on the hardware hosting the FLUX server.
Congratulations! You've successfully created your own FLUX server using Python. This setup allows you to generate images based on text prompts via a simple API. If you're not satisfied with the results of the base FLUX model, you might consider fine-tuning the model for even better performance on specific use cases.
You may find the full code used in this guide below:
from fastapi import FastAPI, HTTPException from pydantic import BaseModel, Field, conint, confloat from fastapi.middleware.gzip import GZipMiddleware from io import BytesIO import base64 app = FastAPI() # We will be returning the image as a base64 encoded string # which we will want compressed app.add_middleware(GZipMiddleware, minimum_size=1000, compresslevel=7)
The above is the detailed content of Creating an AI-powered Image Generation API Service with FLUX, Python, and Diffusers. For more information, please follow other related articles on the PHP Chinese website!