An operator for managing ephemeral clusters in GKE

Last update: Oct 22, 2022

Comments: 7

Test Cluster Operator for GKE

This operator provides an API-driven cluster provisioning for integration and performance testing of software that integrates very deeply with Kubernetes APIs, cloud provider APIs and other elements of Kubernetes cluster infrastructure.

This project was developed at Isovalent for very specific needs and later open-sourced as-is. At present, a few code changes will be needed to remove assumption about how it's deployed, and deployment docs are also needed. If you wish to use this operator and contribute to the project, please join the #testing channel on Cilium & eBPF Slack.

Motivation

NB: Current implementation focuses on GKE (and GCP), most of the ideas described below would apply to any managed Kubernetes providers, and some are even more general.

It is relatively easy to test a workload on Kubernetes, whether it's just an application comprised of multiple components, or a basic operator that glues together a handful of Kubernetes APIs.

Cluster-sharing is one option, that is when application under test is deployed into one or more namespaces in a large shared cluster. Another option is to run setup small test clusters using something like kind or k3s/k3d within a CI environment.

If the application under test depends on non-namespaced resources, cluster-sharing is still possible with VirtualCluster. That way instances of the application under test can be isolated from one another, but only if Kubernetes API boundaries are fully respected. It implies that a large underlying cluster will still be used, but virtually divided into small "pretend-clusters". However, that will only work if the application doesn't make assumptions about cloud provider APIs and doesn't attempt non-trivial modes of access to the underlying host OS or the network infrastructure.

When the application under test interacts with multiple Kubernetes APIs, and presumes cluster-wide access, and even attempts to interact with the underlying host OS or the network infrastructure, any kind of cluster-sharing setup may prove challenging to use. It may also be deemed unrepresentative of the clusters that end-users run. Additionally, testing integrations with cloud provider APIs may have other implications. Applications that enable core functionality of Kubernetes cluster often fall into this category, e.g. CNI implementations, storage controllers, service meshes, ingress controller, etc. Cluster-sharing becomes just not viable for some of these use-cases. And something like kind or k3d is of very limited use.

All of the above applies to testing of applications by developers that are directly responsible for the application itself. End-users may need to test off-the-shelf application they are deploying also. Quite commonly in large organisation, an operations team will assemble a bundle of Kubernetes addons that defines a platform that their organisation relies on. The operations team may not be able to make direct changes to source code of some of the components in order to improved testability for cluster-sharing, or they just won't have the confidence in testing those components in a shared cluster. Even if one version was easily testable in a shared cluster, it may change in the future. While testing on kind or k3s remains an option, it may be undesirable due to cloud provider integration that needs to be tested also, and could be just unrepresentative of the deployment target. Therefore, the operations team may have strong preference to test in a cluster that is provisioned in exactly the same way as the deployment target and has mostly identical or comparable properties.

These are just some of the use-cases that illustrate a need for getting a dedicated cluster for running integrations or performance tests, one that matches deployment target as closely as possible.

What does it take to obtain a cluster in GKE? Technically, it's possible to simply write a script that calls gcloud commands, or relies on something like Terraform or use API client to provision a cluster. This approach inevitably adds a lot of complexity to the CI job by inheriting all the different failure modes there are to the provisioning and destruction processes, it needs to carry any additional infrastructure configuration (e.g. metric & log gathering), widens access scopes etc. Aside from all of the steps that take time and are hard to optimise, it is possible to have a pool of pre-built clusters, yet make the script even more complex. It is hard to maintain complex scripts of this kind long-term, as by nature scripts don't offers a clear contract (especially the shell scripts). The lack of contract makes it too easy for anyone to tweak a shell script for an ad-hoc use-case without adding any tests. Over time, script evolution is hard to unwind, especially in a context where many developers contribute to the project. In contrast, an API offers many advantages - it's a contract, and the implementation can be optimised more easily.

Architectural goals of this project

Test Cluster API
- enables developer and CI jobs to request clusters for running tests in a consistent and well-defined manner
- provider abstraction that will enable future optimisations, e.g. pooling of pre-built cluster
Asynchronous workflow
- avoid heavy-lifting logic in CI jobs that doesn't directly relate to building binaries or executing tests
- avoid polling for status
  - once cluster is ready, launch a test runner job inside the management cluster, report the results back to GitHub
Enable support multiple test cluster templates
- do not assume there is only one type of test cluster configuration that's going to be used for all purposes
- allow for pooling pre-built clusters base on commonly used templates
Include a common set of components in each test cluster
- Prometheus
- Log exporter for CI

You may ask...

How is this different from something like Cluster API?

The Test Cluster API is aimed to be much more high-level and shouldn't need to expose as many parameters as Cluster API does, in fact, it can be implemented on top of Cluster API. The initial implementation targets GKE, and relies on Config Connector, which is similar to Cluster API in spirit.

What about other providers?

This is something that authors of this project are planning on exploring, albeit it may not be done as part of the same project to begin with. One of the ideas is to create a generic provider based on either Terraform or Cluster API, possibly both.

How it works

There is a management cluster that runs on GKE, it has Config Connector, Cert Manager and Contour along with the GKE Test Cluster operator ("the operator" from here onwards).

User creates a CR similar to this:

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: testme-1
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      image: cilium/cilium-test:8cfdbfe
      command:
      - /usr/local/bin/test-gke.sh
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  location: europe-west2-b
  region: europe-west2

The operator renders various objects for Config Connector and other APIs as defined in basic template, it substitutes the given parameters, i.e. machineType, nodes etc, and then it creates all of these objects and monitors the cluster until it's ready.

Once the test cluster is ready, it deploys the job using the given image and command, and ensures the job is authenticated to run against the test cluster. The job runs inside management cluster. The test cluster is deleted upon job completion.

The template is defined using CUE and can define any Kubernetes objects, such as Config Connector objects that define additional GCP resources or some other objects in the management cluster to support test execution. That being said, the implementation currently expects to find exactly one ContainerCluster as part of the template and it's not fully generalised.

As part of test cluster provisioning, Prometheus is deployed in the test cluster and metrics are federated to the Prometheus server in the management cluster, so all metrics from all test runs can be accessed centrally. In the future other components can be added as needed.

Example 2

Here is what a TestClusterGKE object may look like with additional fields and status.

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: test-c6v87
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      command:
      - /usr/local/bin/run_in_test_cluster.sh
      - --prom-name=prom
      - --prom-ns=prom
      - --duration=30m
      configMap: test-c6v87-user
      image: cilium/hubble-perf-test:8cfdbfe
      initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
  location: europe-west2-b
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  region: europe-west2

status:
  clusterName: test-c6v87-fn86p
  conditions:
  - lastTransitionTime: "2020-11-17T09:29:33Z"
    message: All 2 dependencies are ready
    reason: AllDependenciesReady
    status: "True"
    type: Ready
  dependencyConditions:
    ContainerCluster:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:22Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready
    ContainerNodePool:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:33Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready

Using Test Cluster Requester

There is a simple Go program that serves as a client to the GKE Test Cluster Operator.

It can be use by CI jobs as well as developers.

Developer Usage

To run this program outside CI, you must ensure that Google Cloud SDK Application credentials are setup correctly, to do so, run:

gcloud auth application-default login

Run:

go run ./requester --namespace=test-clusters-dev --description=""

CI Usage

This program supports the traditional GOOGLE_APPLICATION_CREDENTIALS environment variable, but also for convenience it has GCP_SERVICE_ACCOUNT_KEY that is expected to contain a base64-encoded JSON service account key (i.e. no need to have the data written to a file).

For GitHub Actions, it's recommended to use the official image:

      - name: Request GKE test cluster
        uses: docker://quay.io/isovalent/gke-test-cluster-requester:ad06d7c2151d012901fc2ddc92406044f2ffba2d
        env:
          GCP_SERVICE_ACCOUNT_KEY: ${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          args: --namespace=... --image=...

Comments

GKE API changes - automatic upgrades and repairs

Looks like there were API changes in GKE and automatic upgrades and rapair are now required in regular release channel...

Here's an example:

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  creationTimestamp: "2021-02-03T02:17:46Z"
  generation: 1
  name: test-66gx7
  namespace: test-clusters
  resourceVersion: "155013016"
  selfLink: /apis/clusters.ci.cilium.io/v1alpha2/namespaces/test-clusters/testclustersgke/test-66gx7
  uid: acd5f665-2fd6-4bc2-b401-417227e5ce91
spec:
  configTemplate: basic
  jobSpec:
    runner:
      command:
      - /usr/local/bin/cilium-test-gke.sh
      - quay.io/cilium/cilium:latest
      - quay.io/cilium/operator-generic:latest
      - quay.io/cilium/hubble-relay:latest
      - NightlyPolicyStress
      image: cilium/cilium-test-dev:7cdf8024e
      initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
  location: europe-west2-b
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  region: europe-west2
status:
  clusterName: test-66gx7-c5lsf
  conditions:
  - lastTransitionTime: "2021-02-03T16:52:07Z"
    message: Some dependencies are not ready yet
    reason: DependenciesNotReady
    status: "False"
    type: Ready
  dependencyConditions:
    ContainerCluster:test-clusters/test-66gx7-c5lsf:
    - lastTransitionTime: "2021-02-03T16:52:07Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready
    ContainerNodePool:test-clusters/test-66gx7-c5lsf:
    - lastTransitionTime: "2021-02-03T02:17:46Z"
      message: 'Update call failed: error applying desired state: summary: error creating
        NodePool: googleapi: Error 400: Auto_upgrade and auto_repair cannot be false
        when release_channel REGULAR is set., badRequest, detail: '
      reason: UpdateFailed
      status: "False"
      type: Ready

Originally this was intentional, as actually for testing purposes it's best to have these features disable.

Use defaults for `autoRepair` and `autoUpgrade`

This is to fix #11.

These two features had been deliberately disable because they were deemed to interfere with tests. GKE API no longer allows to disable these features when using REGULAR release channel.

It maybe possible to disable this once cluster version would be set statically, but that's not supported by the operator yet.

It's important to note that both of these features are concerting the nodepool, not the control plane.

With regards to auto-updades, it maybe viable to define a maintenance window that is outside of the expect test duration, but it's not a very trivial solution to something that is currently only a hypothetical problem.

With regards to auto-repairs, there is also no known practical issue at present.
Use artifacts instead of cache

The setup can be simplified as cost-saving workaround of using cache is no longer needed since the repo is public, and we can use artifacts as much as we like. Eventually the images should be pushed to the ghcr.io, but for now this works.

retry creating cluster in different zone when one is out of resources

This is related to #18, but is actually a separate issue.

Sometimes a zone is short of resources, and GKE yields:

  Warning  UpdateFailed        12m (x4 over 23m)   containercluster-controller  Update call failed: error applying desired state: summary: Error waiting for creating GKE cluster: Try a different location, or try again later: Google Compute Engine does not have enough resources available to fulfill request: europe-west2-b., detail:

One of the purpose of this operator was exactly to cater for this type of errors and retry.

detect unhealthy objects over prolonged period of time

There should alerting in place when there are continuous CNRM errors over relatively long period of time, namely something like cluster didn't get created after 20 minutes (see e.g. #11).
logview should handle error states better

Right now an init container error and probably other errors result in cannot get log stream, it should probably display log of e.g. the init container.
profile and reduce memory usage

bd733f1bdc81cbaf0096a6386a231f7ff7375c65 increased memory requests and limits due to an outage. 800M is a lot of memory, there is likely to be a leak.

An operator for managing ephemeral clusters in GKE

Test Cluster Operator for GKE

Motivation

Architectural goals of this project

You may ask...

How it works

Example 2

Using Test Cluster Requester

Developer Usage

CI Usage

Owner

Isovalent

Comments

GKE API changes - automatic upgrades and repairs

Use defaults for `autoRepair` and `autoUpgrade`

Use artifacts instead of cache

retry creating cluster in different zone when one is out of resources

detect unhealthy objects over prolonged period of time

logview should handle error states better

profile and reduce memory usage

Related tags

Go-gke-pulumi - A simple example that deploys a GKE cluster and an application to the cluster using pulumi

PolarDB Stack is a DBaaS implementation for PolarDB-for-Postgres, as an operator creates and manages PolarDB/PostgreSQL clusters running in Kubernetes. It provides re-construct, failover swtich-over, scale up/out, high-available capabilities for each clusters.

A small utility to generate a kubectl configuration file for all clusters you have access to in GKE.

Basic Kubernetes operator that have multiple versions in CRD. This operator can be used to experiment and understand Operator/CRD behaviors.

An operator which complements grafana-operator for custom features which are not feasible to be merged into core operator

KinK is a helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Designed to ease clusters up for fast testing with batteries included in mind.

Managing your Kubernetes clusters (including public, private, edge, etc) as easily as visiting the Internet

Kubernetes operator to autoscale Google's Cloud Bigtable clusters

Nebula Operator manages NebulaGraph clusters on Kubernetes and automates tasks related to operating a NebulaGraph cluster

A k8s operator to reduce CO2 footprint of your clusters

The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases

A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes

This repo contains example on how to consume secrets from Google Secret Manager from GKE

Install hubble-ui on GKE Dataplane V2

A plugin for Hashicorp Vault to create ephemeral users and API tokens for Jenkins CI

The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Test Operator using operator-sdk 1.15

Minecraft-operator - A Kubernetes operator for Minecraft Java Edition servers

K8s-network-config-operator - Kubernetes network config operator to push network config to switches