Building a GPU SaaS Platform

Depending on the requirements of the application, the container technology would be a good choice for us. We could choose Docker or Kubernetes. We could also use Block Storage, Object Storage, or File Storage, such as OpenEBS, MinIO, or Ceph. We could also use a cloud provider, such as AWS, Azure, or GCP. So, how do we choose the right technology for our GPU SaaS platform? Let’s discuss it.

Technology Choices

Container Technology

Docker

Pros
- Easy to use, easy to operate
- Flexible, we can use it for any requirements
Cons
- Not suitable for large-scale applications
- We need to implement our own orchestration system, management the lifecycle of containers, and so on

Kubernetes

Pros
- Scalable, we can use it for large-scale applications
- It has a lot of features, such as auto-scaling, self-healing, and so on
Cons
- Complex, we need to learn a lot of things

Depending on the workload, and human resources, I choose Kubernetes.

Storage Technology

	MinIO	OpenEBS（jiva）	Rook(Ceph)
support of community	No	yes	yes
storage type	Object	Block	Block, Object, File
deployment	Easy	Easy	Easy
operation	Easy	Complex	Complex
PVC auto-provision	directPV	yes	yes
IOPS
Big Data I/O
scheduling cost

For our scenario

the user need to store the model data which is large, around 10GB.
there will be many small files, around 100000 files for each user.
each user will have multiple storages.

As you see above, there are some empty cells. We need to do some experiments to fill them. I will write another article about it. However, I could tell you that Ceph is the best choice for us.

Other Technology

There are many frameworks for our scenario, such as Kubeflow, JupyterHub, and so on. But, in this article, I would like to focus on how to develop a GPU SaaS platform, so I will not discuss them.

Design Spec

Architecture

Functional Architecture

Technical Architecture

In this series, I will only focus on the runtime plane. I will not discuss the control plane, such as the management of users.

Runtime Plane

I will use Kubernetes Operator to manage the lifecycle of the GPU instances. For MVP, I will only discuss one cluster. However, in the future, when we need to scale out, I will discuss how to manage multiple clusters. So, let’s discuss the runtime plane.

Ingress: control the access to the GPU instances, each user will have a unique URL to access their GPU instances.
Reverse Proxy: manage the access to the GPU instances, such as load balancing, authentication, and so on.
GPU Controller: manage the lifecycle of the GPU instances, such as create, delete, and so on.
Storage Service: manage the storage service, such as upload, download, and so on.

GPU instance state

init: stock state, locks resources
ready: user starts the instance, but the readiness probe is not ready
running: the readiness probe is ready, the user can access the instance

Create Stocks of GPU instances

User starts the GPU instance with the storage

User access the GPU instance

Operator API

Create Stocks

POST /stocks

{
  "number":      int,
  "operationID": string,
  "specName":    string,
  "cpu":         resource.Quantity,
  "memory":      resource.Quantity,
  "gpuType":     string,
  "gpuNum":      int,
}

Delete Stocks

DELETE /stocks

{
  "number":       int,
  "operationID":  string,
  "specName":	    string,
}

Start GPU instance

POST /gpu-instances

{
  "instanceID":   string,
  "tenantID":     string,
  "tenantName":   string,
  "image":        string,
  "storageID":    string,
  "specName":     string,
  "template": {
    "ports": []{
      "port": int,
      "protocol": corev1.Protocol,
      "isPublic": bool,
      "isProbe": bool,
      "baseUrl": string,
    },
    "envs": []{
      "name": string,
      "value": string,
    },
    "cmd": string,
    "volumes": []{
      "name": string,
      "mountPath": string,
      "readOnly": bool,
    },
  }
}

Stop GPU instance

DELETE /gpu-instances/{instanceID}

GPU instance state

GET /gpu-instances/{instanceID}

response:
{
  "state": string,
}

Storage Service API

I would not discuss the design of the storage service in this article. However, I would list the API of the storage service.

Create Storage
Delete Storage
Update Storage
Create Storage Accessor
Delete Storage Accessor

Building a GPU SaaS Platform - The design spec