Building a GPU SaaS Platform

Our Driving Example

Nowadays, GPUs are widely used in many fields, such as deep learning, drawing, and video processing. Especially when OpenAI released ChatGPT, our daily life has been influenced by AI technology more than ever. However, the cost of GPU is still high, and it is not easy to get a GPU server. Therefore, we want to build a GPU SaaS platform to provide GPU services.

What should our GPU SaaS platform do?

Now, we have no ambition to build a platform like AWS or GCP. In this series of articles, we will focus on building a [Stable Diffusion WebUI] (https://github.com/AUTOMATIC1111/stable-diffusion-webui) Saas for designers. Of course, the platform we build can be used for other purposes.

So, what should our GPU SaaS platform do? In my opinion, it should have the following features for users:

Easy to use: Users can use the platform without any knowledge of Kubernetes, Docker, or other technologies.
Easy to manage: Users can manage their services easily.

For administrators, the platform should have the following features:

Easy to deploy: The platform should be easy to deploy and maintain.
Easy to scale: The platform should be easy to scale up or down.
Easy to monitor: The platform should be easy to monitor.
Easy to manage: The platform should be easy to manage.

We will discuss how to achieve these goals in the following articles. But before that, I would like to describe that how users use our platform.

Select a GPU specification. For example, 1 3080 GPU with 4 CPU cores and 16GB memory, or 1 4090 GPU with 8 CPU cores and 32GB memory.
Select the version of Stable Diffusion WebUI. For example, 1.6.0 or 1.7.0.
Start the service.
Save their data to some storage. For example, their local computer or Google Drive.
Stop the service.
Restart the service to continue their work.

For administrators, they should be able to:

Config the GPU specifications and stock. For example, adjust the number of GPUs, CPU cores, and memory.
Charge users for using the service. For example, set the price of each GPU specification per hour.
Monitor the status of the service. For example, observe the status of services.
Statistics the usage of the service. For example, how many users are using the service, how many GPUs are used, how many GPUs are idle, etc.

Coming Up Next

In the next series of articles, we will discuss the following topics:

write a requirements document and design document.
decide the architecture of the platform, and choose the technologies.
write the code of the platform, and test it.
observe the status of the platform, and statistics the usage of the platform.
improve the performance of the platform.