An operator for managing ephemeral clusters in GKE

Test Cluster Operator for GKE

This operator provides an API-driven cluster provisioning for integration and performance testing of software that integrates very deeply with Kubernetes APIs, cloud provider APIs and other elements of Kubernetes cluster infrastructure.

This project was developed at Isovalent for very specific needs and later open-sourced as-is. At present, a few code changes will be needed to remove assumption about how it's deployed, and deployment docs are also needed. If you wish to use this operator and contribute to the project, please join the #testing channel on Cilium & eBPF Slack.

Motivation

NB: Current implementation focuses on GKE (and GCP), most of the ideas described below would apply to any managed Kubernetes providers, and some are even more general.

It is relatively easy to test a workload on Kubernetes, whether it's just an application comprised of multiple components, or a basic operator that glues together a handful of Kubernetes APIs.

Cluster-sharing is one option, that is when application under test is deployed into one or more namespaces in a large shared cluster. Another option is to run setup small test clusters using something like kind or k3s/k3d within a CI environment.

If the application under test depends on non-namespaced resources, cluster-sharing is still possible with VirtualCluster. That way instances of the application under test can be isolated from one another, but only if Kubernetes API boundaries are fully respected. It implies that a large underlying cluster will still be used, but virtually divided into small "pretend-clusters". However, that will only work if the application doesn't make assumptions about cloud provider APIs and doesn't attempt non-trivial modes of access to the underlying host OS or the network infrastructure.

When the application under test interacts with multiple Kubernetes APIs, and presumes cluster-wide access, and even attempts to interact with the underlying host OS or the network infrastructure, any kind of cluster-sharing setup may prove challenging to use. It may also be deemed unrepresentative of the clusters that end-users run. Additionally, testing integrations with cloud provider APIs may have other implications. Applications that enable core functionality of Kubernetes cluster often fall into this category, e.g. CNI implementations, storage controllers, service meshes, ingress controller, etc. Cluster-sharing becomes just not viable for some of these use-cases. And something like kind or k3d is of very limited use.

All of the above applies to testing of applications by developers that are directly responsible for the application itself. End-users may need to test off-the-shelf application they are deploying also. Quite commonly in large organisation, an operations team will assemble a bundle of Kubernetes addons that defines a platform that their organisation relies on. The operations team may not be able to make direct changes to source code of some of the components in order to improved testability for cluster-sharing, or they just won't have the confidence in testing those components in a shared cluster. Even if one version was easily testable in a shared cluster, it may change in the future. While testing on kind or k3s remains an option, it may be undesirable due to cloud provider integration that needs to be tested also, and could be just unrepresentative of the deployment target. Therefore, the operations team may have strong preference to test in a cluster that is provisioned in exactly the same way as the deployment target and has mostly identical or comparable properties.

These are just some of the use-cases that illustrate a need for getting a dedicated cluster for running integrations or performance tests, one that matches deployment target as closely as possible.

What does it take to obtain a cluster in GKE? Technically, it's possible to simply write a script that calls gcloud commands, or relies on something like Terraform or use API client to provision a cluster. This approach inevitably adds a lot of complexity to the CI job by inheriting all the different failure modes there are to the provisioning and destruction processes, it needs to carry any additional infrastructure configuration (e.g. metric & log gathering), widens access scopes etc. Aside from all of the steps that take time and are hard to optimise, it is possible to have a pool of pre-built clusters, yet make the script even more complex. It is hard to maintain complex scripts of this kind long-term, as by nature scripts don't offers a clear contract (especially the shell scripts). The lack of contract makes it too easy for anyone to tweak a shell script for an ad-hoc use-case without adding any tests. Over time, script evolution is hard to unwind, especially in a context where many developers contribute to the project. In contrast, an API offers many advantages - it's a contract, and the implementation can be optimised more easily.

Architectural goals of this project

  • Test Cluster API
    • enables developer and CI jobs to request clusters for running tests in a consistent and well-defined manner
    • provider abstraction that will enable future optimisations, e.g. pooling of pre-built cluster
  • Asynchronous workflow
    • avoid heavy-lifting logic in CI jobs that doesn't directly relate to building binaries or executing tests
    • avoid polling for status
      • once cluster is ready, launch a test runner job inside the management cluster, report the results back to GitHub
  • Enable support multiple test cluster templates
    • do not assume there is only one type of test cluster configuration that's going to be used for all purposes
    • allow for pooling pre-built clusters base on commonly used templates
  • Include a common set of components in each test cluster
    • Prometheus
    • Log exporter for CI

You may ask...

How is this different from something like Cluster API?

The Test Cluster API is aimed to be much more high-level and shouldn't need to expose as many parameters as Cluster API does, in fact, it can be implemented on top of Cluster API. The initial implementation targets GKE, and relies on Config Connector, which is similar to Cluster API in spirit.

What about other providers?

This is something that authors of this project are planning on exploring, albeit it may not be done as part of the same project to begin with. One of the ideas is to create a generic provider based on either Terraform or Cluster API, possibly both.

How it works

There is a management cluster that runs on GKE, it has Config Connector, Cert Manager and Contour along with the GKE Test Cluster operator ("the operator" from here onwards).

User creates a CR similar to this:

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: testme-1
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      image: cilium/cilium-test:8cfdbfe
      command:
      - /usr/local/bin/test-gke.sh
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  location: europe-west2-b
  region: europe-west2

The operator renders various objects for Config Connector and other APIs as defined in basic template, it substitutes the given parameters, i.e. machineType, nodes etc, and then it creates all of these objects and monitors the cluster until it's ready.

Once the test cluster is ready, it deploys the job using the given image and command, and ensures the job is authenticated to run against the test cluster. The job runs inside management cluster. The test cluster is deleted upon job completion.

The template is defined using CUE and can define any Kubernetes objects, such as Config Connector objects that define additional GCP resources or some other objects in the management cluster to support test execution. That being said, the implementation currently expects to find exactly one ContainerCluster as part of the template and it's not fully generalised.

As part of test cluster provisioning, Prometheus is deployed in the test cluster and metrics are federated to the Prometheus server in the management cluster, so all metrics from all test runs can be accessed centrally. In the future other components can be added as needed.

Example 2

Here is what a TestClusterGKE object may look like with additional fields and status.

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: test-c6v87
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      command:
      - /usr/local/bin/run_in_test_cluster.sh
      - --prom-name=prom
      - --prom-ns=prom
      - --duration=30m
      configMap: test-c6v87-user
      image: cilium/hubble-perf-test:8cfdbfe
      initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
  location: europe-west2-b
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  region: europe-west2

status:
  clusterName: test-c6v87-fn86p
  conditions:
  - lastTransitionTime: "2020-11-17T09:29:33Z"
    message: All 2 dependencies are ready
    reason: AllDependenciesReady
    status: "True"
    type: Ready
  dependencyConditions:
    ContainerCluster:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:22Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready
    ContainerNodePool:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:33Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready

Using Test Cluster Requester

There is a simple Go program that serves as a client to the GKE Test Cluster Operator.

It can be use by CI jobs as well as developers.

Developer Usage

To run this program outside CI, you must ensure that Google Cloud SDK Application credentials are setup correctly, to do so, run:

gcloud auth application-default login

Run:

go run ./requester --namespace=test-clusters-dev --description=""

CI Usage

This program supports the traditional GOOGLE_APPLICATION_CREDENTIALS environment variable, but also for convenience it has GCP_SERVICE_ACCOUNT_KEY that is expected to contain a base64-encoded JSON service account key (i.e. no need to have the data written to a file).

For GitHub Actions, it's recommended to use the official image:

      - name: Request GKE test cluster
        uses: docker://quay.io/isovalent/gke-test-cluster-requester:ad06d7c2151d012901fc2ddc92406044f2ffba2d
        env:
          GCP_SERVICE_ACCOUNT_KEY: ${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          args: --namespace=... --image=...
Comments
  • GKE API changes - automatic upgrades and repairs

    GKE API changes - automatic upgrades and repairs

    Looks like there were API changes in GKE and automatic upgrades and rapair are now required in regular release channel...

    Here's an example:

    apiVersion: clusters.ci.cilium.io/v1alpha2
    kind: TestClusterGKE
    metadata:
      creationTimestamp: "2021-02-03T02:17:46Z"
      generation: 1
      name: test-66gx7
      namespace: test-clusters
      resourceVersion: "155013016"
      selfLink: /apis/clusters.ci.cilium.io/v1alpha2/namespaces/test-clusters/testclustersgke/test-66gx7
      uid: acd5f665-2fd6-4bc2-b401-417227e5ce91
    spec:
      configTemplate: basic
      jobSpec:
        runner:
          command:
          - /usr/local/bin/cilium-test-gke.sh
          - quay.io/cilium/cilium:latest
          - quay.io/cilium/operator-generic:latest
          - quay.io/cilium/hubble-relay:latest
          - NightlyPolicyStress
          image: cilium/cilium-test-dev:7cdf8024e
          initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
      location: europe-west2-b
      machineType: n1-standard-4
      nodes: 2
      project: cilium-ci
      region: europe-west2
    status:
      clusterName: test-66gx7-c5lsf
      conditions:
      - lastTransitionTime: "2021-02-03T16:52:07Z"
        message: Some dependencies are not ready yet
        reason: DependenciesNotReady
        status: "False"
        type: Ready
      dependencyConditions:
        ContainerCluster:test-clusters/test-66gx7-c5lsf:
        - lastTransitionTime: "2021-02-03T16:52:07Z"
          message: The resource is up to date
          reason: UpToDate
          status: "True"
          type: Ready
        ContainerNodePool:test-clusters/test-66gx7-c5lsf:
        - lastTransitionTime: "2021-02-03T02:17:46Z"
          message: 'Update call failed: error applying desired state: summary: error creating
            NodePool: googleapi: Error 400: Auto_upgrade and auto_repair cannot be false
            when release_channel REGULAR is set., badRequest, detail: '
          reason: UpdateFailed
          status: "False"
          type: Ready
    

    Originally this was intentional, as actually for testing purposes it's best to have these features disable.

  • Use defaults for `autoRepair` and `autoUpgrade`

    Use defaults for `autoRepair` and `autoUpgrade`

    This is to fix #11.

    These two features had been deliberately disable because they were deemed to interfere with tests. GKE API no longer allows to disable these features when using REGULAR release channel.

    It maybe possible to disable this once cluster version would be set statically, but that's not supported by the operator yet.

    It's important to note that both of these features are concerting the nodepool, not the control plane.

    With regards to auto-updades, it maybe viable to define a maintenance window that is outside of the expect test duration, but it's not a very trivial solution to something that is currently only a hypothetical problem.

    With regards to auto-repairs, there is also no known practical issue at present.

  • Use artifacts instead of cache

    Use artifacts instead of cache

    The setup can be simplified as cost-saving workaround of using cache is no longer needed since the repo is public, and we can use artifacts as much as we like. Eventually the images should be pushed to the ghcr.io, but for now this works.

  • retry creating cluster in different zone when one is out of resources

    retry creating cluster in different zone when one is out of resources

    This is related to #18, but is actually a separate issue.

    Sometimes a zone is short of resources, and GKE yields:

      Warning  UpdateFailed        12m (x4 over 23m)   containercluster-controller  Update call failed: error applying desired state: summary: Error waiting for creating GKE cluster: Try a different location, or try again later: Google Compute Engine does not have enough resources available to fulfill request: europe-west2-b., detail:
    

    One of the purpose of this operator was exactly to cater for this type of errors and retry.

  • detect unhealthy objects over prolonged period of time

    detect unhealthy objects over prolonged period of time

    There should alerting in place when there are continuous CNRM errors over relatively long period of time, namely something like cluster didn't get created after 20 minutes (see e.g. #11).

  • logview should handle error states better

    logview should handle error states better

    Right now an init container error and probably other errors result in cannot get log stream, it should probably display log of e.g. the init container.

  • profile and reduce memory usage

    profile and reduce memory usage

    bd733f1bdc81cbaf0096a6386a231f7ff7375c65 increased memory requests and limits due to an outage. 800M is a lot of memory, there is likely to be a leak.

Go-gke-pulumi - A simple example that deploys a GKE cluster and an application to the cluster using pulumi

This example deploys a Google Cloud Platform (GCP) Google Kubernetes Engine (GKE) cluster and an application to it

Jan 25, 2022
PolarDB Stack is a DBaaS implementation for PolarDB-for-Postgres, as an operator creates and manages PolarDB/PostgreSQL clusters running in Kubernetes. It provides re-construct, failover swtich-over, scale up/out, high-available capabilities for each clusters.
PolarDB Stack is a DBaaS implementation for PolarDB-for-Postgres, as an operator creates and manages PolarDB/PostgreSQL clusters running in Kubernetes. It provides re-construct, failover swtich-over, scale up/out, high-available capabilities for each clusters.

PolarDB Stack开源版生命周期 1 系统概述 PolarDB是阿里云自研的云原生关系型数据库,采用了基于Shared-Storage的存储计算分离架构。数据库由传统的Share-Nothing,转变成了Shared-Storage架构。由原来的N份计算+N份存储,转变成了N份计算+1份存储

Nov 8, 2022
A small utility to generate a kubectl configuration file for all clusters you have access to in GKE.

gke-config-helper A small utility to generate a kubectl configuration file for all clusters you have access to in GKE. Usage $ gke-config-helper The b

Feb 9, 2022
Basic Kubernetes operator that have multiple versions in CRD. This operator can be used to experiment and understand Operator/CRD behaviors.

add-operator Basic Kubernetes operator that have multiple versions in CRD. This operator can be used to experiment and understand Operator/CRD behavio

Dec 15, 2021
An operator which complements grafana-operator for custom features which are not feasible to be merged into core operator

Grafana Complementary Operator A grafana which complements grafana-operator for custom features which are not feasible to be merged into core operator

Aug 16, 2022
KinK is a helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Designed to ease clusters up for fast testing with batteries included in mind.
KinK is a helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Designed to ease clusters up for fast testing with batteries included in mind.

kink A helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Table of Contents kink (KinD in Kubernetes) Introduction How it works ?

Dec 10, 2022
Managing your Kubernetes clusters (including public, private, edge, etc) as easily as visiting the Internet

Clusternet Managing Your Clusters (including public, private, hybrid, edge, etc) as easily as Visiting the Internet. Clusternet (Cluster Internet) is

Dec 30, 2022
Kubernetes operator to autoscale Google's Cloud Bigtable clusters
Kubernetes operator to autoscale Google's Cloud Bigtable clusters

Bigtable Autoscaler Operator Bigtable Autoscaler Operator is a Kubernetes Operator to autoscale the number of nodes of a Google Cloud Bigtable instanc

Nov 5, 2021
Nebula Operator manages NebulaGraph clusters on Kubernetes and automates tasks related to operating a NebulaGraph cluster

Nebula Operator manages NebulaGraph clusters on Kubernetes and automates tasks related to operating a NebulaGraph cluster. It evolved from NebulaGraph Cloud Service, makes NebulaGraph a truly cloud-native database.

Dec 31, 2022
A k8s operator to reduce CO2 footprint of your clusters
A k8s operator to reduce CO2 footprint of your clusters

How many of your dev/preview pods stay on during weekends? Or at night? It's a waste of resources! And money! But fear not, kube-green is here to the

Jan 3, 2023
The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases

The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases. It eliminates the dependency on a human operator or administrator for the majority of database operations.

Dec 14, 2022
A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes
A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes

csi-driver-spiffe csi-driver-spiffe is a Container Storage Interface (CSI) driver plugin for Kubernetes to work along cert-manager. This CSI driver tr

Dec 1, 2022
This repo contains example on how to consume secrets from Google Secret Manager from GKE

GKE Secret Manager. Environment setup This repo contains examples of how to consume secrets from Google Secret Manager (GSM) from Google Kubernetes En

Dec 5, 2022
Install hubble-ui on GKE Dataplane V2

GKE Hubble Export This is a grpc server wrapper that re-export the cilium agent's observer service and peer service from the local domain socket. And

Jan 2, 2023
A plugin for Hashicorp Vault to create ephemeral users and API tokens for Jenkins CI
A plugin for Hashicorp Vault to create ephemeral users and API tokens for Jenkins CI

vault-plugin-secrets-jenkins This is a backend plugin to be used with Hashicorp Vault. This plugin generates ephemeral Jenkins Users and API tokens. v

Dec 15, 2022
The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Elastalert Operator for Kubernetes The Elastalert Operator is an implementation of a Kubernetes Operator. Getting started Firstly, learn How to use el

Jun 28, 2022
Test Operator using operator-sdk 1.15

test-operator Test Operator using operator-sdk 1.15 operator-sdk init --domain rbt.com --repo github.com/ravitri/test-operator Writing kustomize manif

Dec 28, 2021
Minecraft-operator - A Kubernetes operator for Minecraft Java Edition servers

Minecraft Operator A Kubernetes operator for dedicated servers of the video game

Dec 15, 2022
K8s-network-config-operator - Kubernetes network config operator to push network config to switches

Kubernetes Network operator Will add more to the readme later :D Operations The

May 16, 2022