Frisbee is a Kubernetes-native platform for exploring, testing, and benchmarking distributed applications.

Computer Architecture and VLSI Systems (CARV) Laboratory

Last update: Dec 14, 2022

Comments: 16

Why Frisbee ?

Frisbee is a next generation platform designed to unify chaos testing and perfomance benchmarking.

We address the key pain points developers and QA engineers face when testing cloud-native applications in the earlier stages of the software lifecycle.

We make it possible to:

Write tests: for stressing complex topologies and dynamic operating conditions.
Run tests: provides seamless scaling from a single workstation to hundreds of machines.
Debug tests: through extensive monitoring and comprehensive dashboards

Our platform consists of a set of Kubernetes controller designed to run performance benchmarks and introduce failure conditions into a running system, monitor site-wide health metrics, and notify systems with status updates during the testing procedure.

Frisbee provides a flexible, YAML-based configuration syntax and is trivially extensible with additional functionality.

Frisbee in a nutshell

The easiest way to begin with is by have a look at the examples. It consists of two sub-directories:

Templates: are libraries of frequently-used specifications that are reusable throughout the testing plan.
Testplans: are lists of actions that define what will happen throughout the test.

We will use the examples/testplans/3.failover.yml as a reference.

This plans uses the following templates:

examples/templates/core/sysmon.yml
examples/templates/redis/redis.cluster.yml
examples/templates/ycsb/redis.client.yml

Because these templates are deployed as Kubernetes resources, they are references by name rather than by the relative path.

This is why we need to have them installed before running the experiment. (for installation instructions check here.)

# Standard Kubernetes boilerplate
apiVersion: frisbee.io/v1alpha1
kind: Workflow
metadata:
  name: redis-failover
spec:

  # Here we specify the workflow as a directed-acyclic graph (DAG) by specifying the dependencies of each action.
  actions:
    # Service creates an instance of a Redis Master
    # To create the instance we use the redis/master with the default parameters.
    - action: Service
      name: master
      service:
        fromTemplate:
          templateRef: redis/master

    # This action is same as before, with two additions. 
    # 1. The `depends' keyword ensure that the action will be executed only after the `master' action 
    # has reached a Running state.
    # 2. The `inputs' keyword initialized the instance with custom parameters. 
    - action: Service
      name: slave
      depends: { running: [ master ] }
      service:
        fromTemplate:
          templateRef: redis/slave
          inputs:
            - { master: .service.master.any }

    # The sentinel is Redis failover manager. Notice that we can have multiple dependencies.
    - action: Service
      name: sentinel
      depends: { running: [ master, slave ] }
      service:
        fromTemplate:
          templateRef: redis/sentinel
          inputs:
            - { master: .service.master.any }

    # Cluster creates a list of services that run a shared context. 
    # In this case, we create a cluster of YCSB loaders to populate the master with keys. 
    - action: Cluster
      name: "loaders"
      depends: { running: [ master ] }
      cluster:
        templateRef: ycsb-redis/loader
        inputs:
          - { server: .service.master.any, recordcount: "100000000", offset: "0" }
          - { server: .service.master.any, recordcount: "100000000", offset: "100000000" }
          - { server: .service.master.any, recordcount: "100000000", offset: "200000000" }

    # While the loaders are running, we inject a network partition fault to the master node. 
    # The "after" dependency adds a delay so to have some keys before injecting the fault. 
    # The fault is automatically retracted after 2 minutes. 
    - action: Chaos
      name: partition0
      depends: { running: [ loaders ], after: "3m" }
      chaos:
        type: partition
        partition:
          selector:
            macro: .service.master.any
          duration: "2m"

    # Here we repeat the partition, a few minutes after the previous fault has been recovered.
    - action: Chaos
      name: partition1
      depends: { running: [ master, slave ], success: [ partition0 ], after: "6m" }
      chaos:
        type: partition
        partition:
          selector: { macro: .service.master.any }
          duration: "1m"

  # Here we declare the Grafana dashboards that Workflow will make use of.
  withTelemetry:
    importMonitors: [ "sysmon/container", "ycsbmon/client",  "redismon/server" ]
    ingress:
      host: localhost
      useAmbassador: true

  # Now, the experiment is over ... or not ? 
  # The loaders are complete, the partition are retracted, but the Redis nodes are still running.
  # Hence, how do we know if the test has passed or fail ? 
  # This task is left to the oracle. 
  withTestOracle:
    pass: >-
      {{.IsSuccessful "partition1"}} == true

Run the experiment

Firstly, you'll need a Kubernetes deployment and kubectl set-up

For a single-node deployment click here.
For a multi-node deployment click here.

In this walk-through, we assume you have followed the instructions for the single-node deployment.

In one terminal, run the Frisbee controller.

If you want to run the webhooks locally, you’ll have to generate certificates for serving the webhooks, and place them in the right directory (/tmp/k8s-webhook-server/serving-certs/tls.{crt,key}, by default).

If you’re not running a local API server, you’ll also need to figure out how to proxy traffic from the remote cluster to your local webhook server. For this reason, we generally recommend disabling webhooks when doing your local code-run-test cycle, as we do below.

# Run the Frisbee controller
>>  make run ENABLE_WEBHOOKS=false

We can use the controller's output to reason about the experiments transition.

On the other terminal, you can issue requests.

# Create a dedicated Frisbee name
>> kubectl create namespace frisbee

# Run a testplan (from Frisbee directory)
>> kubectl -n frisbee apply -f examples/testplans/3.failover.yml 
workflow.frisbee.io/redis-failover created

# Confirm that the workflow is running.
>> kubectl -n frisbee get pods
NAME         READY   STATUS    RESTARTS   AGE
prometheus   1/1     Running   0          12m
grafana      1/1     Running   0          12m
master       3/3     Running   0          12m
loaders-0    3/3     Running   0          11m
slave        3/3     Running   0          11m
sentinel     1/1     Running   0          11m


# Wait until the test oracle is triggered.
>> kubectl -n frisbee wait --for=condition=oracle workflows.frisbee.io/redis-failover
...

How can I understand what happened ?

One way, is to access the workflow's description

>> kubectl -n frisbee describe workflows.frisbee.io/validate-local

But why bother if you can access Grafana directly ?

Click Here

If everything went smoothly, you should see a similar dashboard. Through these dashboards humans and controllers can examine to check things like completion, health, and SLA compliance.

Client-View (YCSB-Dashboard)

Client-View (Redis-Dashboard)

Bugs, Feedback, and Contributions

The original intention of our open source project is to lower the threshold of testing distributed systems, so we highly value the use of the project in enterprises and in academia.

For bug report, questions and discussions please submit GitHub Issues.

We welcome also every contribution, even if it is just punctuation. See details of CONTRIBUTING

For more information, you can contact us via:

Email: [email protected]

License

Frisbee is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Acknowledgements

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 894204 (Ether, H2020-MSCA-IF-2019).

Owner

Computer Architecture and VLSI Systems (CARV) Laboratory

https://github.com/CARV-ICS-FORTH/frisbee

Comments

Grafana does not report I/O for NVME devices

Is your feature request related to a problem? Please describe. Grafana plots the I/O usage by checking the device pattern device=~"^/dev/[sv]d[a-z][1-9]$

This pattern however does not account for ssd-capable machines, whose path is typically /dev/nvm*

Describe the solution you'd like Fix the regex.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.
kubectl-frisbee command

description

kubectl-frisbee command doesn't see the configuration file used to connect to the server (/var/snap/microk8s/3883/credentials/client.config). I think that's the reason for :
Exactly-once creation per Alert

Take as an example a cluster that generates new clients when a metrics-driven condition is met.

Currently, the cluster will keep creating clients for as long as the alert remains active.

Is this the desired behaviour? Or should we create one client per alert?
Add unified Stopper

Is your feature request related to a problem? Please describe. In many cases, we need to gracefully stop a service, or a cluster of services. For example, when we want to run a full set of a YCSB Workflow.

Describe the solution you'd like Implement stopper.

As a first step, this stopper should gracefully stop services and clusters. This means that the stop objects will terminate gracefully returning Success. This contrasts with a Chaos killing action, where killed objects return Fail.

On the a second step, the stopper should also be able to stop a Chaos action.
Erroneously Report on I/O pressure
Describe the bug The reported I/O pressure in Grafana is wrong.

According to the cadvisor documentation:

container_fs_reads_total: Cumulative count of reads completed

container_fs_writes_total: Cumulative count of writes completed

This makes clear that the reported values are the number of requests, not the transferred bytes as it is erroneously reported in Grafana.
Network and I/O throttling.

Is your feature request related to a problem? Please describe. By default Kubernetes supports throttling for Memory and CPU.

It does not support throttling for Network and I/O.

Describe the solution you'd like Use Chaos-Mesh capabilities to emulate network and I/O throttling.

Describe alternatives you've considered XXX

Additional context A known limitation is that Network/IO throttling can only be used as "limits". There is no notion for "reservation".
Auto-discover monitoring packages in the Workflow

Is your feature request related to a problem? Please describe. For the monitoring to work we need both telemetry Agents and Dashboard.

The agents are deployed by services. The dashboard must be installed by the Workflow. For the moment, the definition of the used dashboard in the workflow is done automatically.

Describe the solution you'd like When the workflow starts, go through all the used templates and find the respective dashboards.

Describe alternatives you've considered

Additional context
Add support for Kubernetes e2e testing

Is your feature request related to a problem? Please describe. At its core, Frisbee remains a distributed system that requires end-to-end testing.

Describe the solution you'd like Integrate Kubernetes e2e to test the various Frisbee components.

Describe alternatives you've considered

Additional context Source: https://kubernetes.io/blog/2019/03/22/kubernetes-end-to-end-testing-for-everyone/
Create paper-ready graphs from Grafana

Is your feature request related to a problem? Please describe. Although we can export Grafana's visualization, as described in the README.md, the visualizations are not paper-friendly.

Describe the solution you'd like Funnel Grafana outputs to a script that will provide paper-friendly plots.

Describe alternatives you've considered XXX

Additional context XXX
Domain

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: What is changed and how it works?

Proposal: xxx

What's Changed: Related changes

Need to update Frisbee Dashboard component, related issue: Need to cheery-pick to the release branch

Checklist

Tests

Unit test E2E test Manual test (add detailed scripts or steps below) No code

Side effects

Breaking backward compatibility

Release note

Please add a release note. If you don't think this PR needs a release note then fill it with None.

Add support for block devices

We currently support only mountpoints.

But cadvisor does not report stats for NFS.

To do so, we must provide abstractions for mounting volumes as raw block devices (/dev/xdva)

Naming conventions:

/dev/sd* (SCSI Disk) are set for Boot devices
/dev/xvd* (XEN Virtual Device) are set for Extension devices

Based on the AWS Docs, the following apply:

"/dev/sda1" is reserved for ROOT Volume on both Windows and Linux.
"xvd*" is recommended for EBS and Instance Store in Windows.
"/dev/sd*" is recommended for EBS and Instance Store in Linux.

Rename compute-lifecycle to MapStates

For the management of lifecycle, this solution seems to be cleaner than the current one.

https://github.com/ohsu-comp-bio/funnel/blob/master/compute/hpc_backend.go
Fix Running Dependencies

For the moment, we consider a frisbee.service running when the pod becomes running.

This, however, can cause issues for process that take long time to init. Instead, use the pod.container[app].status.started field as the basis for declaring a frisbee.service as running.
Validate that a reference callable exists

This is a bit tricky and cannot be done at the admission level, because it requires access to the referenced template.

Instead, it must be done at the scenario level, when all the templates are loaded.
Fix install to download the charts of a specific version

Because the controller is downloaded from the "release" where the scenarios are accessed by the repo, there can be significant drifts.

One solution is to put all the changes into a branch and merge them into main branch only to make the release.

The other solution is to put changes into main branch, and fix the install.sh to fetch examples from the release.
Add support for coverage-driven events
Such events can be used to:

Abort the experiment if executed

Abort the experiment if not executed

Used as Grafana annotations to facilitate understanding of multi-stage workflows.

Fast cross-platform HTTP benchmarking tool written in Go

bombardier bombardier is a HTTP(S) benchmarking tool. It is written in Go programming language and uses excellent fasthttp instead of Go's default htt

Jan 2, 2023

gokp aims to install a GitOps Native Kubernetes Platform

gokp gokp aims to install a GitOps Native Kubernetes Platform. This project is a Proof of Concept centered around getting a GitOps aware Kubernetes Pl

Nov 4, 2022

Litmus helps Kubernetes SREs and developers practice chaos engineering in a Kubernetes native way.

Litmus Cloud-Native Chaos Engineering Read this in other languages. ???? ???? ???? ???? Overview Litmus is a toolset to do cloud-native chaos engineer

Jan 1, 2023

The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

OCI Service Operator for Kubernetes Introduction The OCI Service Operator for Kubernetes (OSOK) makes it easy to create, manage, and connect to Oracle

Sep 27, 2022

Becca - A simple dynamic language for exploring language design

Becca A simple dynamic language for exploring language design What is Becca Becc

Aug 15, 2022

Kubernetes IN Docker - local clusters for testing Kubernetes

kind is a tool for running local Kubernetes clusters using Docker container "nodes".

Jan 5, 2023

Kubernetes IN Docker - local clusters for testing Kubernetes

Please see Our Documentation for more in-depth installation etc. kind is a tool for running local Kubernetes clusters using Docker container "nodes".

Feb 14, 2022

Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

May 19, 2021

🔥 🔥 Open source cloud native security observability platform. Linux, K8s, AWS Fargate and more. 🔥 🔥

CVE-2021-44228 Log4J Vulnerability can be detected at runtime and attack paths can be visualized by ThreatMapper. Live demo of Log4J Vulnerability her

Jan 1, 2023

Stuff to make standing up sigstore (esp. for testing) easier for e2e/integration testing.

sigstore-scaffolding This repository contains scaffolding to make standing up a full sigstore stack easier and automatable. Our focus is on running on

Dec 27, 2022

Zadig is a cloud native, distributed, developer-oriented continuous delivery product.

Zadig Developer-oriented Continuous Delivery Product English | 简体中文 Table of Contents Zadig Table of Contents What is Zadig Quick start How to use? Ho

Jan 8, 2023

Zadig is a cloud native, distributed, developer-oriented continuous delivery product.

Zadig Developer-oriented Continuous Delivery Product ⁣ English | 简体中文 Table of Contents Zadig Table of Contents What is Zadig Quick start How to use?

May 12, 2021

Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform.

robolaunch ?? Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform. robola

Jan 1, 2023

Layotto is an application runtime developed using Golang, which provides various distributed capabilities for applications

Layotto is an application runtime developed using Golang, which provides various distributed capabilities for applications, such as state management, configuration management, and event pub/sub capabilities to simplify application development.

Jan 8, 2023

Frisbee is a Kubernetes-native platform for exploring, testing, and benchmarking distributed applications.

Why Frisbee ?

Frisbee in a nutshell

Run the experiment

How can I understand what happened ?

Client-View (YCSB-Dashboard)

Client-View (Redis-Dashboard)

Bugs, Feedback, and Contributions

License

Acknowledgements

Owner

Computer Architecture and VLSI Systems (CARV) Laboratory

Comments

Grafana does not report I/O for NVME devices

kubectl-frisbee command

description

Exactly-once creation per Alert

Add unified Stopper

Erroneously Report on I/O pressure

Network and I/O throttling.

Auto-discover monitoring packages in the Workflow

Add support for Kubernetes e2e testing

Create paper-ready graphs from Grafana

Domain

Add support for block devices

Rename compute-lifecycle to MapStates

Fix Running Dependencies

Validate that a reference callable exists

Fix install to download the charts of a specific version

Add support for coverage-driven events

Related tags

Fast cross-platform HTTP benchmarking tool written in Go

gokp aims to install a GitOps Native Kubernetes Platform

Litmus helps Kubernetes SREs and developers practice chaos engineering in a Kubernetes native way.

The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

Becca - A simple dynamic language for exploring language design

Kubernetes IN Docker - local clusters for testing Kubernetes

Kubernetes IN Docker - local clusters for testing Kubernetes

Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

🔥 🔥 Open source cloud native security observability platform. Linux, K8s, AWS Fargate and more. 🔥 🔥

Stuff to make standing up sigstore (esp. for testing) easier for e2e/integration testing.

Zadig is a cloud native, distributed, developer-oriented continuous delivery product.

Zadig is a cloud native, distributed, developer-oriented continuous delivery product.

Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform.

Layotto is an application runtime developed using Golang, which provides various distributed capabilities for applications

Kubernetes-native framework for test definition and execution

Cloud Native Electronic Trading System built on Kubernetes and Knative Eventing

:rocket: Modern cross-platform HTTP load-testing tool written in Go

Build and deploy Go applications on Kubernetes

⚡️ A dev tool for microservice developers to run local applications and/or forward others from/to Kubernetes SSH or TCP