Frisbee is a Kubernetes-native platform for exploring, testing, and benchmarking distributed applications.

Why Frisbee ?

Frisbee is a next generation platform designed to unify chaos testing and perfomance benchmarking.

We address the key pain points developers and QA engineers face when testing cloud-native applications in the earlier stages of the software lifecycle.

We make it possible to:

  • Write tests: for stressing complex topologies and dynamic operating conditions.
  • Run tests: provides seamless scaling from a single workstation to hundreds of machines.
  • Debug tests: through extensive monitoring and comprehensive dashboards

Our platform consists of a set of Kubernetes controller designed to run performance benchmarks and introduce failure conditions into a running system, monitor site-wide health metrics, and notify systems with status updates during the testing procedure.

Frisbee provides a flexible, YAML-based configuration syntax and is trivially extensible with additional functionality.

Frisbee in a nutshell

The easiest way to begin with is by have a look at the examples. It consists of two sub-directories:

  • Templates: are libraries of frequently-used specifications that are reusable throughout the testing plan.
  • Testplans: are lists of actions that define what will happen throughout the test.

We will use the examples/testplans/3.failover.yml as a reference.

This plans uses the following templates:

  • examples/templates/core/sysmon.yml
  • examples/templates/redis/redis.cluster.yml
  • examples/templates/ycsb/redis.client.yml

Because these templates are deployed as Kubernetes resources, they are references by name rather than by the relative path.

This is why we need to have them installed before running the experiment. (for installation instructions check here.)

# Standard Kubernetes boilerplate
apiVersion: frisbee.io/v1alpha1
kind: Workflow
metadata:
  name: redis-failover
spec:

  # Here we specify the workflow as a directed-acyclic graph (DAG) by specifying the dependencies of each action.
  actions:
    # Service creates an instance of a Redis Master
    # To create the instance we use the redis/master with the default parameters.
    - action: Service
      name: master
      service:
        fromTemplate:
          templateRef: redis/master

    # This action is same as before, with two additions. 
    # 1. The `depends' keyword ensure that the action will be executed only after the `master' action 
    # has reached a Running state.
    # 2. The `inputs' keyword initialized the instance with custom parameters. 
    - action: Service
      name: slave
      depends: { running: [ master ] }
      service:
        fromTemplate:
          templateRef: redis/slave
          inputs:
            - { master: .service.master.any }

    # The sentinel is Redis failover manager. Notice that we can have multiple dependencies.
    - action: Service
      name: sentinel
      depends: { running: [ master, slave ] }
      service:
        fromTemplate:
          templateRef: redis/sentinel
          inputs:
            - { master: .service.master.any }

    # Cluster creates a list of services that run a shared context. 
    # In this case, we create a cluster of YCSB loaders to populate the master with keys. 
    - action: Cluster
      name: "loaders"
      depends: { running: [ master ] }
      cluster:
        templateRef: ycsb-redis/loader
        inputs:
          - { server: .service.master.any, recordcount: "100000000", offset: "0" }
          - { server: .service.master.any, recordcount: "100000000", offset: "100000000" }
          - { server: .service.master.any, recordcount: "100000000", offset: "200000000" }

    # While the loaders are running, we inject a network partition fault to the master node. 
    # The "after" dependency adds a delay so to have some keys before injecting the fault. 
    # The fault is automatically retracted after 2 minutes. 
    - action: Chaos
      name: partition0
      depends: { running: [ loaders ], after: "3m" }
      chaos:
        type: partition
        partition:
          selector:
            macro: .service.master.any
          duration: "2m"

    # Here we repeat the partition, a few minutes after the previous fault has been recovered.
    - action: Chaos
      name: partition1
      depends: { running: [ master, slave ], success: [ partition0 ], after: "6m" }
      chaos:
        type: partition
        partition:
          selector: { macro: .service.master.any }
          duration: "1m"

  # Here we declare the Grafana dashboards that Workflow will make use of.
  withTelemetry:
    importMonitors: [ "sysmon/container", "ycsbmon/client",  "redismon/server" ]
    ingress:
      host: localhost
      useAmbassador: true

  # Now, the experiment is over ... or not ? 
  # The loaders are complete, the partition are retracted, but the Redis nodes are still running.
  # Hence, how do we know if the test has passed or fail ? 
  # This task is left to the oracle. 
  withTestOracle:
    pass: >-
      {{.IsSuccessful "partition1"}} == true          

Run the experiment

Firstly, you'll need a Kubernetes deployment and kubectl set-up

  • For a single-node deployment click here.

  • For a multi-node deployment click here.

In this walk-through, we assume you have followed the instructions for the single-node deployment.

In one terminal, run the Frisbee controller.

If you want to run the webhooks locally, youโ€™ll have to generate certificates for serving the webhooks, and place them in the right directory (/tmp/k8s-webhook-server/serving-certs/tls.{crt,key}, by default).

If youโ€™re not running a local API server, youโ€™ll also need to figure out how to proxy traffic from the remote cluster to your local webhook server. For this reason, we generally recommend disabling webhooks when doing your local code-run-test cycle, as we do below.

# Run the Frisbee controller
>>  make run ENABLE_WEBHOOKS=false

We can use the controller's output to reason about the experiments transition.

On the other terminal, you can issue requests.

# Create a dedicated Frisbee name
>> kubectl create namespace frisbee

# Run a testplan (from Frisbee directory)
>> kubectl -n frisbee apply -f examples/testplans/3.failover.yml 
workflow.frisbee.io/redis-failover created

# Confirm that the workflow is running.
>> kubectl -n frisbee get pods
NAME         READY   STATUS    RESTARTS   AGE
prometheus   1/1     Running   0          12m
grafana      1/1     Running   0          12m
master       3/3     Running   0          12m
loaders-0    3/3     Running   0          11m
slave        3/3     Running   0          11m
sentinel     1/1     Running   0          11m


# Wait until the test oracle is triggered.
>> kubectl -n frisbee wait --for=condition=oracle workflows.frisbee.io/redis-failover
...

How can I understand what happened ?

One way, is to access the workflow's description

>> kubectl -n frisbee describe workflows.frisbee.io/validate-local

But why bother if you can access Grafana directly ?

Click Here

If everything went smoothly, you should see a similar dashboard. Through these dashboards humans and controllers can examine to check things like completion, health, and SLA compliance.

Client-View (YCSB-Dashboard)

image-20211008230432961

Client-View (Redis-Dashboard)

Bugs, Feedback, and Contributions

The original intention of our open source project is to lower the threshold of testing distributed systems, so we highly value the use of the project in enterprises and in academia.

For bug report, questions and discussions please submit GitHub Issues.

We welcome also every contribution, even if it is just punctuation. See details of CONTRIBUTING

For more information, you can contact us via:

License

Frisbee is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Acknowledgements

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 894204 (Ether, H2020-MSCA-IF-2019).

Owner
Computer Architecture and VLSI Systems (CARV) Laboratory
Computer Architecture and VLSI Systems (CARV) Laboratory
Comments
  • Grafana does not report I/O for NVME devices

    Grafana does not report I/O for NVME devices

    Is your feature request related to a problem? Please describe. Grafana plots the I/O usage by checking the device pattern device=~"^/dev/[sv]d[a-z][1-9]$

    This pattern however does not account for ssd-capable machines, whose path is typically /dev/nvm*

    Describe the solution you'd like Fix the regex.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Additional context Add any other context or screenshots about the feature request here.

  • kubectl-frisbee command

    kubectl-frisbee command

    description

    kubectl-frisbee command doesn't see the configuration file used to connect to the server (/var/snap/microk8s/3883/credentials/client.config). I think that's the reason for : image

  • Exactly-once creation per Alert

    Exactly-once creation per Alert

    Take as an example a cluster that generates new clients when a metrics-driven condition is met.

    Currently, the cluster will keep creating clients for as long as the alert remains active.

    Is this the desired behaviour? Or should we create one client per alert?

  • Add unified Stopper

    Add unified Stopper

    Is your feature request related to a problem? Please describe. In many cases, we need to gracefully stop a service, or a cluster of services. For example, when we want to run a full set of a YCSB Workflow.

    Describe the solution you'd like Implement stopper.

    As a first step, this stopper should gracefully stop services and clusters. This means that the stop objects will terminate gracefully returning Success. This contrasts with a Chaos killing action, where killed objects return Fail.

    On the a second step, the stopper should also be able to stop a Chaos action.

  • Erroneously Report on I/O pressure

    Erroneously Report on I/O pressure

    Describe the bug The reported I/O pressure in Grafana is wrong.

    According to the cadvisor documentation:

    • container_fs_reads_total: Cumulative count of reads completed
    • container_fs_writes_total: Cumulative count of writes completed

    This makes clear that the reported values are the number of requests, not the transferred bytes as it is erroneously reported in Grafana.

  • Network and I/O throttling.

    Network and I/O throttling.

    Is your feature request related to a problem? Please describe. By default Kubernetes supports throttling for Memory and CPU.

    It does not support throttling for Network and I/O.

    Describe the solution you'd like Use Chaos-Mesh capabilities to emulate network and I/O throttling.

    Describe alternatives you've considered XXX

    Additional context A known limitation is that Network/IO throttling can only be used as "limits". There is no notion for "reservation".

  • Auto-discover monitoring packages in the Workflow

    Auto-discover monitoring packages in the Workflow

    Is your feature request related to a problem? Please describe. For the monitoring to work we need both telemetry Agents and Dashboard.

    The agents are deployed by services. The dashboard must be installed by the Workflow. For the moment, the definition of the used dashboard in the workflow is done automatically.

    Describe the solution you'd like When the workflow starts, go through all the used templates and find the respective dashboards.

    Describe alternatives you've considered

    Additional context

  • Add support for Kubernetes e2e testing

    Add support for Kubernetes e2e testing

    Is your feature request related to a problem? Please describe. At its core, Frisbee remains a distributed system that requires end-to-end testing.

    Describe the solution you'd like Integrate Kubernetes e2e to test the various Frisbee components.

    Describe alternatives you've considered

    Additional context Source: https://kubernetes.io/blog/2019/03/22/kubernetes-end-to-end-testing-for-everyone/

  • Create paper-ready graphs from Grafana

    Create paper-ready graphs from Grafana

    Is your feature request related to a problem? Please describe. Although we can export Grafana's visualization, as described in the README.md, the visualizations are not paper-friendly.

    Describe the solution you'd like Funnel Grafana outputs to a script that will provide paper-friendly plots.

    Describe alternatives you've considered XXX

    Additional context XXX

  • Domain

    Domain

    What problem does this PR solve?

    Issue Number: close #xxx

    Problem Summary: What is changed and how it works?

    Proposal: xxx

    What's Changed: Related changes

    Need to update Frisbee Dashboard component, related issue: Need to cheery-pick to the release branch

    Checklist

    Tests

    Unit test E2E test Manual test (add detailed scripts or steps below) No code

    Side effects

    Breaking backward compatibility

    Release note

    Please add a release note. If you don't think this PR needs a release note then fill it with None.

  • Add support for block devices

    Add support for block devices

    We currently support only mountpoints.

    But cadvisor does not report stats for NFS.

    To do so, we must provide abstractions for mounting volumes as raw block devices (/dev/xdva)

    Naming conventions:

    /dev/sd* (SCSI Disk) are set for Boot devices
    /dev/xvd* (XEN Virtual Device) are set for Extension devices
    

    Based on the AWS Docs, the following apply:

    "/dev/sda1" is reserved for ROOT Volume on both Windows and Linux.
    "xvd*" is recommended for EBS and Instance Store in Windows.
    "/dev/sd*" is recommended for EBS and Instance Store in Linux.
    
  • Rename compute-lifecycle to MapStates

    Rename compute-lifecycle to MapStates

    For the management of lifecycle, this solution seems to be cleaner than the current one.

    https://github.com/ohsu-comp-bio/funnel/blob/master/compute/hpc_backend.go

  • Fix Running Dependencies

    Fix Running Dependencies

    For the moment, we consider a frisbee.service running when the pod becomes running.

    This, however, can cause issues for process that take long time to init. Instead, use the pod.container[app].status.started field as the basis for declaring a frisbee.service as running.

  • Validate that a reference callable exists

    Validate that a reference callable exists

    This is a bit tricky and cannot be done at the admission level, because it requires access to the referenced template.

    Instead, it must be done at the scenario level, when all the templates are loaded.

  • Fix install to download the charts of a specific version

    Fix install to download the charts of a specific version

    Because the controller is downloaded from the "release" where the scenarios are accessed by the repo, there can be significant drifts.

    One solution is to put all the changes into a branch and merge them into main branch only to make the release.

    The other solution is to put changes into main branch, and fix the install.sh to fetch examples from the release.

  • Add support for coverage-driven events

    Add support for coverage-driven events

    Such events can be used to:

    1. Abort the experiment if executed
    2. Abort the experiment if not executed
    3. Used as Grafana annotations to facilitate understanding of multi-stage workflows.
Fast cross-platform HTTP benchmarking tool written in Go

bombardier bombardier is a HTTP(S) benchmarking tool. It is written in Go programming language and uses excellent fasthttp instead of Go's default htt

Jan 2, 2023
gokp aims to install a GitOps Native Kubernetes Platform

gokp gokp aims to install a GitOps Native Kubernetes Platform. This project is a Proof of Concept centered around getting a GitOps aware Kubernetes Pl

Nov 4, 2022
Litmus helps Kubernetes SREs and developers practice chaos engineering in a Kubernetes native way.
Litmus helps Kubernetes SREs and developers practice chaos engineering in a Kubernetes native way.

Litmus Cloud-Native Chaos Engineering Read this in other languages. ???? ???? ???? ???? Overview Litmus is a toolset to do cloud-native chaos engineer

Jan 1, 2023
The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

OCI Service Operator for Kubernetes Introduction The OCI Service Operator for Kubernetes (OSOK) makes it easy to create, manage, and connect to Oracle

Sep 27, 2022
Becca - A simple dynamic language for exploring language design

Becca A simple dynamic language for exploring language design What is Becca Becc

Aug 15, 2022
Kubernetes IN Docker - local clusters for testing Kubernetes
Kubernetes IN Docker - local clusters for testing Kubernetes

kind is a tool for running local Kubernetes clusters using Docker container "nodes".

Jan 5, 2023
Kubernetes IN Docker - local clusters for testing Kubernetes
Kubernetes IN Docker - local clusters for testing Kubernetes

Please see Our Documentation for more in-depth installation etc. kind is a tool for running local Kubernetes clusters using Docker container "nodes".

Feb 14, 2022
Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

May 19, 2021
๐Ÿ”ฅ ๐Ÿ”ฅ Open source cloud native security observability platform. Linux, K8s, AWS Fargate and more. ๐Ÿ”ฅ ๐Ÿ”ฅ
๐Ÿ”ฅ ๐Ÿ”ฅ   Open source cloud native security observability platform. Linux, K8s, AWS Fargate and more. ๐Ÿ”ฅ ๐Ÿ”ฅ

CVE-2021-44228 Log4J Vulnerability can be detected at runtime and attack paths can be visualized by ThreatMapper. Live demo of Log4J Vulnerability her

Jan 1, 2023
Stuff to make standing up sigstore (esp. for testing) easier for e2e/integration testing.
Stuff to make standing up sigstore (esp. for testing) easier for e2e/integration testing.

sigstore-scaffolding This repository contains scaffolding to make standing up a full sigstore stack easier and automatable. Our focus is on running on

Dec 27, 2022
Zadig is a cloud native, distributed, developer-oriented continuous delivery product.

Zadig Developer-oriented Continuous Delivery Product English | ็ฎ€ไฝ“ไธญๆ–‡ Table of Contents Zadig Table of Contents What is Zadig Quick start How to use? Ho

Jan 8, 2023
Zadig is a cloud native, distributed, developer-oriented continuous delivery product.

Zadig Developer-oriented Continuous Delivery Product โฃ English | ็ฎ€ไฝ“ไธญๆ–‡ Table of Contents Zadig Table of Contents What is Zadig Quick start How to use?

May 12, 2021
Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform.

robolaunch ?? Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform. robola

Jan 1, 2023
Layotto is an application runtime developed using Golang, which provides various distributed capabilities for applications
Layotto is an application runtime developed using Golang, which provides various distributed capabilities for applications

Layotto is an application runtime developed using Golang, which provides various distributed capabilities for applications, such as state management, configuration management, and event pub/sub capabilities to simplify application development.

Jan 8, 2023
Kubernetes-native framework for test definition and execution

โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ

Dec 31, 2022
Cloud Native Electronic Trading System built on Kubernetes and Knative Eventing

Ingenium -- Still heavily in prototyping stage -- Ingenium is a cloud native electronic trading system built on top of Kubernetes and Knative Eventing

Aug 29, 2022
:rocket: Modern cross-platform HTTP load-testing tool written in Go
:rocket: Modern cross-platform HTTP load-testing tool written in Go

English | ไธญๆ–‡ Cassowary is a modern HTTP/S, intuitive & cross-platform load testing tool built in Go for developers, testers and sysadmins. Cassowary d

Dec 29, 2022
Build and deploy Go applications on Kubernetes
Build and deploy Go applications on Kubernetes

ko: Easy Go Containers ko is a simple, fast container image builder for Go applications. It's ideal for use cases where your image contains a single G

Jan 5, 2023
โšก๏ธ A dev tool for microservice developers to run local applications and/or forward others from/to Kubernetes SSH or TCP
โšก๏ธ A dev tool for microservice developers to run local applications and/or forward others from/to Kubernetes SSH or TCP

Your new microservice development environment friend. This CLI tool allows you to define a configuration to work with both local applications (Go, Nod

Jan 4, 2023