skip to content
logo Loki's Wager

Building a GPU SaaS Platform - The design spec

/ 3 min read

Building a GPU SaaS Platform

Depending on the requirements of the application, the container technology would be a good choice for us. We could choose Docker or Kubernetes. We could also use Block Storage, Object Storage, or File Storage, such as OpenEBS, MinIO, or Ceph. We could also use a cloud provider, such as AWS, Azure, or GCP. So, how do we choose the right technology for our GPU SaaS platform? Let’s discuss it.

Technology Choices

Container Technology

Docker

  • Pros
    • Easy to use, easy to operate
    • Flexible, we can use it for any requirements
  • Cons
    • Not suitable for large-scale applications
    • We need to implement our own orchestration system, management the lifecycle of containers, and so on

Kubernetes

  • Pros
    • Scalable, we can use it for large-scale applications
    • It has a lot of features, such as auto-scaling, self-healing, and so on
  • Cons
    • Complex, we need to learn a lot of things

Depending on the workload, and human resources, I choose Kubernetes.

Storage Technology

MinIOOpenEBS(jiva)Rook(Ceph)
support of communityNoyesyes
storage typeObjectBlockBlock, Object, File
deploymentEasyEasyEasy
operationEasyComplexComplex
PVC auto-provisiondirectPVyesyes
IOPS
Big Data I/O
scheduling cost

For our scenario

  • the user need to store the model data which is large, around 10GB.
  • there will be many small files, around 100000 files for each user.
  • each user will have multiple storages.

As you see above, there are some empty cells. We need to do some experiments to fill them. I will write another article about it. However, I could tell you that Ceph is the best choice for us.

Other Technology

There are many frameworks for our scenario, such as Kubeflow, JupyterHub, and so on. But, in this article, I would like to focus on how to develop a GPU SaaS platform, so I will not discuss them.

Design Spec

Architecture

Functional Architecture

Functional Architecture

Technical Architecture

Technical Architecture

In this series, I will only focus on the runtime plane. I will not discuss the control plane, such as the management of users.

Runtime Plane

I will use Kubernetes Operator to manage the lifecycle of the GPU instances. For MVP, I will only discuss one cluster. However, in the future, when we need to scale out, I will discuss how to manage multiple clusters. So, let’s discuss the runtime plane.

Runtime Plane
  • Ingress: control the access to the GPU instances, each user will have a unique URL to access their GPU instances.
  • Reverse Proxy: manage the access to the GPU instances, such as load balancing, authentication, and so on.
  • GPU Controller: manage the lifecycle of the GPU instances, such as create, delete, and so on.
  • Storage Service: manage the storage service, such as upload, download, and so on.

GPU instance state

GPU instance state
  • init: stock state, locks resources
  • ready: user starts the instance, but the readiness probe is not ready
  • running: the readiness probe is ready, the user can access the instance

Create Stocks of GPU instances

Create Stocks of GPU instances

User starts the GPU instance with the storage

User starts the GPU instance with the storage

User access the GPU instance

User access the GPU instance

Operator API

  • Create Stocks
POST /stocks

{
  "number":      int,
  "operationID": string,
  "specName":    string,
  "cpu":         resource.Quantity,
  "memory":      resource.Quantity,
  "gpuType":     string,
  "gpuNum":      int,
}
  • Delete Stocks
DELETE /stocks

{
  "number":       int,
  "operationID":  string,
  "specName":	    string,
}
  • Start GPU instance
POST /gpu-instances

{
  "instanceID":   string,
  "tenantID":     string,
  "tenantName":   string,
  "image":        string,
  "storageID":    string,
  "specName":     string,
  "template": {
    "ports": []{
      "port": int,
      "protocol": corev1.Protocol,
      "isPublic": bool,
      "isProbe": bool,
      "baseUrl": string,
    },
    "envs": []{
      "name": string,
      "value": string,
    },
    "cmd": string,
    "volumes": []{
      "name": string,
      "mountPath": string,
      "readOnly": bool,
    },
  }
}
  • Stop GPU instance
DELETE /gpu-instances/{instanceID}
  • GPU instance state
GET /gpu-instances/{instanceID}

response:
{
  "state": string,
}

Storage Service API

I would not discuss the design of the storage service in this article. However, I would list the API of the storage service.

  • Create Storage
  • Delete Storage
  • Update Storage
  • Create Storage Accessor
  • Delete Storage Accessor