Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)

Last update: Jan 5, 2023

Comments: 17

flagger

Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. It reduces the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring metrics and running conformance tests.

Flagger implements several deployment strategies (Canary releases, A/B testing, Blue/Green mirroring) using a service mesh (App Mesh, Istio, Linkerd) or an ingress controller (Contour, Gloo, NGINX, Skipper, Traefik) for traffic routing. For release analysis, Flagger can query Prometheus, Datadog, New Relic or CloudWatch and for alerting it uses Slack, MS Teams, Discord and Rocket.

Flagger is a Cloud Native Computing Foundation project and part of Flux family of GitOps tools.

Documentation

Flagger documentation can be found at docs.flagger.app.

Install
- Flagger install on Kubernetes
Usage
Tutorials
- App Mesh
- Istio
- Linkerd
- Contour
- Gloo
- NGINX Ingress
- Skipper
- Traefik
- Kubernetes Blue/Green

Who is using Flagger

List of organizations using Flagger:

If you are using Flagger, please submit a PR to add your organization to the list!

Canary CRD

Flagger takes a Kubernetes deployment and optionally a horizontal pod autoscaler (HPA), then creates a series of objects (Kubernetes deployments, ClusterIP services, service mesh or ingress routes). These objects expose the application on the mesh and drive the canary analysis and promotion.

Flagger keeps track of ConfigMaps and Secrets referenced by a Kubernetes Deployment and triggers a canary analysis if any of those objects change. When promoting a workload in production, both code (container images) and configuration (config maps and secrets) are being synchronised.

For a deployment named podinfo, a canary promotion can be defined using Flagger's custom resource:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # service mesh provider (optional)
  # can be: kubernetes, istio, linkerd, appmesh, nginx, skipper, contour, gloo, supergloo, traefik
  provider: istio
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    # service name (defaults to targetRef.name)
    name: podinfo
    # ClusterIP port number
    port: 9898
    # container port name or number (optional)
    targetPort: 9898
    # port name can be http or grpc (default http)
    portName: http
    # add all the other container ports
    # to the ClusterIP services (default false)
    portDiscovery: true
    # HTTP match conditions (optional)
    match:
      - uri:
          prefix: /
    # HTTP rewrite (optional)
    rewrite:
      uri: /
    # request timeout (optional)
    timeout: 5s
  # promote the canary without analysing it (default false)
  skipAnalysis: false
  # define the canary analysis timing and KPIs
  analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 10
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5
    # validation (optional)
    metrics:
    - name: request-success-rate
      # builtin Prometheus check
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # builtin Prometheus check
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    - name: "database connections"
      # custom metric check
      templateRef:
        name: db-connections
      thresholdRange:
        min: 2
        max: 100
      interval: 1m
    # testing (optional)
    webhooks:
      - name: "conformance test"
        type: pre-rollout
        url: http://flagger-helmtester.test/
        timeout: 5m
        metadata:
          type: "helmv3"
          cmd: "test run podinfo -n test"
      - name: "load test"
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"
    # alerting (optional)
    alerts:
      - name: "dev team Slack"
        severity: error
        providerRef:
          name: dev-slack
          namespace: flagger
      - name: "qa team Discord"
        severity: warn
        providerRef:
          name: qa-discord
      - name: "on-call MS Teams"
        severity: info
        providerRef:
          name: on-call-msteams

For more details on how the canary analysis and promotion works please read the docs.

Features

Service Mesh

Feature	App Mesh	Istio	Linkerd	Kubernetes CNI
Canary deployments (weighted traffic)	✔️	✔️	✔️	➖
A/B testing (headers and cookies routing)	✔️	✔️	➖	➖
Blue/Green deployments (traffic switch)	✔️	✔️	✔️	✔️
Blue/Green deployments (traffic mirroring)	➖	✔️	➖	➖
Webhooks (acceptance/load testing)	✔️	✔️	✔️	✔️
Manual gating (approve/pause/resume)	✔️	✔️	✔️	✔️
Request success rate check (L7 metric)	✔️	✔️	✔️	➖
Request duration check (L7 metric)	✔️	✔️	✔️	➖
Custom metric checks	✔️	✔️	✔️	✔️

Ingress

Feature	Contour	Gloo	NGINX	Skipper	Traefik
Canary deployments (weighted traffic)	✔️	✔️	✔️	✔️	✔️
A/B testing (headers and cookies routing)	✔️	✔️	✔️	➖	➖
Blue/Green deployments (traffic switch)	✔️	✔️	✔️	✔️	✔️
Webhooks (acceptance/load testing)	✔️	✔️	✔️	✔️	✔️
Manual gating (approve/pause/resume)	✔️	✔️	✔️	✔️	✔️
Request success rate check (L7 metric)	✔️	✔️	➖	✔️	✔️
Request duration check (L7 metric)	✔️	✔️	➖	✔️	✔️
Custom metric checks	✔️	✔️	✔️	✔️	✔️

Roadmap

GitOps Toolkit compatibility

Migrate Flagger to Kubernetes controller-runtime and kubebuilder
Make the Canary status compatible with kstatus
Make Flagger emit Kubernetes events compatible with Flux v2 notification API
Integrate Flagger into Flux v2 as the progressive delivery component

Integrations

Add support for Kubernetes Ingress v2
Add support for SMI compatible service mesh solutions like Open Service Mesh and Consul Connect
Add support for ingress controllers like HAProxy and ALB
Add support for metrics providers like InfluxDB, Stackdriver, SignalFX

Contributing

Flagger is Apache 2.0 licensed and accepts contributions via GitHub pull requests. To start contributing please read the development guide.

When submitting bug reports please include as much details as possible:

which Flagger version
which Flagger CRD version
which Kubernetes version
what configuration (canary, ingress and workloads definitions)
what happened (Flagger and Proxy logs)

Getting Help

If you have any questions about Flagger and progressive delivery:

Read the Flagger docs.
Invite yourself to the CNCF community slack and join the #flagger channel.
Check out the Flux talks section and to see a list of online talks, hands-on training and meetups.
File an issue.

Your feedback is always welcome!

Owner

Flux project

Open and extensible continuous delivery solution for Kubernetes

https://github.com/weaveworks/flagger https://flagger.app

Comments

Specifying multiple HTTP match uri in Istio Canary deployment via Flagger

I am gonna use automatic Canary deployment so I tried to follow the process via Flagger. Here was my VirtualService file for routing:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: {{ .Values.project }}
  namespace: {{ .Values.service.namespace }}
spec:
  hosts:
    - {{ .Values.subdomain }}
  gateways:
    - mygateway.istio-system.svc.cluster.local
  http:
    {{- range $key, $value := .Values.routing.http }}
    - name: {{ $key }}
{{ toYaml $value | indent 6 }}
    {{- end }}

Which the routing part looks like this:

http:
    r1:
      match:
        - uri:
            prefix: /myservice/monitor
      route:
        - destination:
            host: myservice
            port:
              number: 9090
    r2:
      match:
        - uri:
            prefix: /myservice
      route:
        - destination:
            host: myservice
            port:
              number: 8080
      corsPolicy:
        allowCredentials: false
        allowHeaders:
        - X-Tenant-Identifier
        - Content-Type
        - Authorization
        allowMethods:
        - GET
        - POST
        - PATCH
        allowOrigin:
        - "*"
        maxAge: 24h    `

However as I found the Flagger overwites the virtualservice, I have removed this file and modified the canary.yaml file based on my requirements but I get yaml error:

{{- if .Values.canary.enabled }}
apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: {{ .Values.project }}
  namespace: {{ .Values.service.namespace }}
  labels:
    app: {{ .Values.project }}
    chart: {{ template "myservice-chart.chart" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name:  {{ .Values.project }}
  progressDeadlineSeconds: 60
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name:  {{ .Values.project }}    
  service:
    port: 8080
    portDiscovery: true
    {{- if .Values.canary.istioIngress.enabled }}
    gateways:
    -  {{ .Values.canary.istioIngress.gateway }}
    hosts:
    - {{ .Values.canary.istioIngress.host }}
    {{- end }}
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    # HTTP match conditions (optional)
    match:
      - uri:
          prefix: /myservice
    # cross-origin resource sharing policy (optional)
      corsPolicy:
        allowOrigin:
          - "*"
        allowMethods:
          - GET
          - POST
          - PATCH
        allowCredentials: false
        allowHeaders:
          - X-Tenant-Identifier
          - Content-Type
          - Authorization
        maxAge: 24h
      - uri:
          prefix: /myservice/monitor
  canaryAnalysis:
    interval: {{ .Values.canary.analysis.interval }}
    threshold: {{ .Values.canary.analysis.threshold }}
    maxWeight: {{ .Values.canary.analysis.maxWeight }}
    stepWeight: {{ .Values.canary.analysis.stepWeight }}
    metrics:
    - name: request-success-rate
      threshold: {{ .Values.canary.thresholds.successRate }}
      interval: 1m
    - name: request-duration
      threshold: {{ .Values.canary.thresholds.latency }}
      interval: 1m
    webhooks:
      {{- if .Values.canary.loadtest.enabled }}
      - name: load-test-get
        url: {{ .Values.canary.loadtest.url }}
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 5 -c 2 http://myservice.default:8080"
      - name: load-test-post
        url: {{ .Values.canary.loadtest.url }}
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 5 -c 2 -m POST -d '{\"test\": true}' http://myservice.default:8080/echo"
      {{- end }}  
{{- end }}

Can anyone help with this issue?

Add canary finalizers

@stefanprodan This is a work in progress PR looking for acceptance on the approach and feedback. This PR provides the opt-in capability for users to revert flagger mutations on deletion of a canary. If users opt-in finalizers will be utilized to revert the mutated resources before the canary and owned resources are handed over for finalizing.

Changes: Add evaluations for finalizers controller/controller Add finalizers source controller/finalizer Add interface method on deployment and daemonset controllers Add interface method on routers Add e2e tests

Work to be done: Cover mesh and ingress outside of Istio

Fix: #388 Fix: #488
Gloo Canary Release Docs Discrepancy

I am trying to get a simple POC working with Gloo and Flagger, however the example docs don't work out-of-the-box.

I also noticed the example virtual-service is different in the docs compared to what's in the repo?

The specifics regarding mapping a virtual-service to an upstream seem to be different in both and I just want to know what I should follow to get this working.

I would make an issue on Gloo's repository, however I'm unsure if my error is stemming with Gloo or me following the wrong docs.
Only unique values for domains are permitted error with Istio 1.1.0 RC1
Right now, due to istio limitations, it is not possible to create a virtualservice with a mesh and another host name. For example:

if I have:

... gateways: - www.myapp.com - mesh http: - match: - uri: prefix: /api route: - destination: host: api.default.svc.cluster.local port: number: 80

and

... gateways: - www.myapp.com - mesh http: - match: - uri: prefix: /internal route: - destination: host: internal.default.svc.cluster.local port: number: 80

Istio will throw an error

Only unique values for domains are permitted. Duplicate entry of domain www.myapp.com"

The two ways of fixing this I see is for flagger to either:

Create a separate virtualservice and maintain the canary settings for each one correlated to the particular service deployed

Compile all virtualservices together into a singular virtualservice

Let me know what you think!
Unable to perform Istio-A/B testing

Hey guys I have configured Istio as a service mesh in my Kubernetes. I wanted to try A/B testing deployment-strategy along with Flagger.

I followed the following documentation to set up Flagger 1.) https://docs.flagger.app/usage/ab-testing 2.) https://docs.flagger.app/how-it-works#a-b-testing

It throws me a VirtualService error: virtualservice:publisher-d8t-v1 Weight sum should be 100 when i check my Kiali dashboard.

And on describing the canary the canary fails due to no traffic was generated. Although i made post call to my service and a response status of 200 was returned.

Can you please help me fix this error.

I have attached the respective screenshots .

VirtualService Error in Kiali:

Canary Status:

Traffic Generation and its status:

Can you please help me resolve this issue !!

Also according to istio documentation, to connect a VirtualService with DestinationRule we need to use subsets. But i see no subsets being created. How are you able to achieve traffic routing without a subset. I did read as note keep a label as app:deployment name, is this solving the purpose ??.

Thanks in advance :)

istio no values found for metric request-success-rate

given following:

      metrics:
      - interval: 1m
        name: request-success-rate
        threshold: 99
      - interval: 30s
        name: request-duration
        threshold: 500
      stepWeight: 10
      threshold: 5
      webhooks:
      - metadata:
          cmd: hey -z 10m -q 10 -c 2 http://conf-day-demo-rest.conf-day-demo:8080/greeting
        name: conf-day-demo-loadtest
        timeout: 5s
        url: http://loadtester.loadtester/

Canary promotion fails with Halt advancement no values found for metric request-success-rate probably conf-day-demo-rest.conf-day-demo is not receiving traffic

querying metrics manually I see metrics for conf-day-demo-rest-primary but flagger queries for

destination_workload=~"{{ .Name }}"
``` which returns no data

Canary ingress nginx prevent update of primary ingress due to admission webhook

Hi all,

we have a problem with ingress admission webhook. Using podinfo as example we did a canary deployment.

Flagger created second ingress and after rollout was done it switch "canary" annotation from "true" to "false":

apiVersion:  networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx-v2
    nginx.ingress.kubernetes.io/canary: "false"

I added "test" annotation to main Ingress to trigger update:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: podinfo
  labels:
    app: podinfo
  annotations:
    kubernetes.io/ingress.class: "nginx-v2"
    test: "test"
...

Now when I try to apply main Ingress file I get admission webhook error:

Error from server (BadRequest): error when creating "podinfo.yaml": 
admission webhook "validate.nginx.ingress.kubernetes.io" 
denied the request: host "example.com" and 
path "/" is already defined in ingress develop/podinfo-canary

podinfo Ingress

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: podinfo
  labels:
    app: podinfo
  annotations:
    kubernetes.io/ingress.class: "nginx-v2"
spec:
  rules:
    - host: example.com
      http:
        paths:
          - backend:
              serviceName: podinfo
              servicePort: 80
  tls:
  - hosts:
    - example.com
    secretName: example.com.wildcard

Flagger version: 1.6.1 Ingress nginx version: 0.43

Blue/Green deployment - ELB collides with ClusterIP Flagger services.

Hey everybody. I wanted to give you some feedback from my learning process using Flagger and ask you a couple of question on how to fix an issue I've been having with my current use case.

Here is it: I have an EKS cluster with two namespace, one for testing (called staging) and another for production. I've been trying to add Flagger to the staging namespace in order to enable Blue Green Deployments from my GitLab pipeline.

How do I do that? Well, I've set up a gitlab job that basically runs a kubectl command and applies the files that I've added below. This is a very basic application, that means I've been trying to implement Blue/Green style deployments with Kubernetes L4 networking.

Here is the order of how files get applied:

namespace
canary
deployment
service

I've also created a drawing to help you illustrate the situation a little bit better.

The problem with this is approach is that as soon as I apply the load balanacer manifest I got this error:

 The Service "my-app" is invalid: spec.ports[2].name: Duplicate value: "http"

I've tried applying this the same configuration on the production environment and it did work. My guess here is that somehow Flagger's ClusterIP services are creating a conflict with my load balancer, leading to a possible collision between them.

I hope that you can help me with this issue, I'll keep you posted if I find a solution.

namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: staging

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: staging
  labels:
    app: my-app
    environment: staging
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: my-app
      environment: staging
  template:
    metadata:
      labels:
        app: my-app
        environment: staging
      annotations:
        configHash: " "
    spec:
      containers:
        - name: my-app
          image: marcoshuck/my-app
          imagePullPolicy: Always
          ports:
            - containerPort: 8001
          envFrom:
            - configMapRef:
                name: my-app-config
      nodeSelector:
        server: "true"

load-balancer.yaml

apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: staging
  annotations:
    # Use HTTP to talk to the backend.
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
    # Amazon (AMC) certificate ARN
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: XXXXXXXXXXXXXXXXXXXXXXXXX
    # Only run SSL on the port named "tls" below.
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
spec:
  type: LoadBalancer
  ports:
  - name: http
    port: 80
    targetPort: 8001
  - name: https
    port: 443
    targetPort: 8001
  selector:
    app: my-app
    environment: staging

canary.yaml

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-app
  namespace: staging
spec:
  provider: kubernetes
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  progressDeadlineSeconds: 60
  service:
    port: 8001
    portDiscovery: true
  analysis:
    interval: 30s
    threshold: 3
    iterations: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 30s
    webhooks:
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 -c 2 http://my-app-canary.test:8001/"

progressDeadlineSeconds not working while waiting for rollout to finish

Hi, In my deployment, I use progressDeadlineSeconds: 1200, and in the canary definition, I use canary deployment with buildin prometheus check. The canary app crashed, the deployment should be rolled back. but seems it isn't.

my-app-deployment-58b7ffb786-7dk4h                     1/2     CrashLoopBackOff   109        9h
my-app-deployment-primary-84f69c75c4-9d7x7             2/2     Running            0          16h

And the flagger logs always shows following messages with infinite loop.

{"level":"info","ts":"2020-03-26T01:38:32.078Z","caller":"controller/events.go:27","msg":"canary deployment my-app-deployment.test not ready with retryable true: waiting for
rollout to finish: 0 of 1 updated replicas are available","canary":"my-app-canary.test"}

The canary I use:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-app-canary
  namespace: test
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-deployment
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 1200
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: my-app-hpa
  service:
    # ClusterIP port number
    port: 80
    # container port name or number (optional)
    targetPort: 8080
    # Istio virtual service host names (optional)
    trafficPolicy:
      tls:
        mode: ISTIO_MUTUAL
  analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed iterations before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
      - name: request-success-rate
        # builtin Prometheus check
        # minimum req success rate (non 5xx responses)
        # percentage (0-100)
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        # builtin Prometheus check
        # maximum req duration P99
        # milliseconds
        thresholdRange:
          max: 500
        interval: 30s
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://blueprint-test-loadtester.blueprint-test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl http://my-app-deployment-canary.test"
      - name: load-test
        type: rollout
        url: http://blueprint-test-loadtester.blueprint-test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://my-app-deployment-canary.test"

Add HTTP match conditions to Canary service spec

Could you show an example of how to use this with the istio ingress? I can't seem to figure out how to point to the correct service!

More specifically, is it possible to tell the istio ingress to route based on certain criteria (i.e. a uri prefix, etc?)

Flagger omits `TrafficSplit` backend service weight if weight is 0 due to `omitempty` option

Describe the bug

Since OSM is supported (SMI support added in #896), I did the following to create a canary deploy using OSM and Flagger. As recommended in #896, I used the MetricsTemplate CRDs to create the required Prometheus custom metrics (request-success-rate and request-duration).

I then created a canary custom resource for podinfo deployment, however it does not succeed. It says that the canary custom resource cannot create a TrafficSplit resource for the canary deployment.

Output excerpt of kubectl describe -f ./podinfo-canary.yaml:

Status:
  Canary Weight:  0
  Conditions:
    Last Transition Time:  2021-06-07T22:28:21Z
    Last Update Time:      2021-06-07T22:28:21Z
    Message:               New Deployment detected, starting initialization.
    Reason:                Initializing
    Status:                Unknown
    Type:                  Promoted
  Failed Checks:           0
  Iterations:              0
  Last Transition Time:    2021-06-07T22:28:21Z
  Phase:                   Initializing
Events:
  Type     Reason  Age                  From     Message
  ----     ------  ----                 ----     -------
  Warning  Synced  5m38s                flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
  Normal   Synced  8s (x12 over 5m38s)  flagger  all the metrics providers are available!
  Warning  Synced  8s (x11 over 5m8s)   flagger  TrafficSplit podinfo.test create error: the server could not find the requested resource (post trafficsplits.split.smi-spec.io)

To Reproduce

./kustomize/osm/kustomization.yaml:

namespace: osm-system
bases:
  - ../base/flagger/
patchesStrategicMerge:
  - patch.yaml

./kustomize/osm/patch.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flagger
spec:
  template:
    spec:
      containers:
        - name: flagger
          args:
            - -log-level=info
            - -include-label-prefix=app.kubernetes.io
            - -mesh-provider=smi:v1alpha3
            - -metrics-server=http://osm-prometheus.osm-system.svc:7070

---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: flagger
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: flagger
subjects:
  - kind: ServiceAccount
    name: flagger
    namespace: osm-system

Used MetricTemplate CRD to implement required custom metric (recommended in #896) - request-success-rate.yaml:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-success-rate
  namespace: osm-system
spec:
  provider:
    type: prometheus
    address: http://osm-prometheus.osm-system.svc:7070
  query: |
    sum(
        rate(
            osm_request_total{
              destination_namespace="{{ namespace }}",
              destination_name="{{ target }}",
              response_code!="404"
            }[{{ interval }}]
        )
    )
    /
    sum(
        rate(
            osm_request_total{
              destination_namespace="{{ namespace }}",
              destination_name="{{ target }}"
            }[{{ interval }}]
        )
    ) * 100

Used MetricTemplate CRD to implement required custom metric (recommended in #896) - request-duration.yaml:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-duration
  namespace: osm-system
spec:
  provider:
    type: prometheus
    address: http://osm-prometheus.osm-system.svc:7070
  query: |
    histogram_quantile(
      0.99,
      sum(
        rate(
          osm_request_duration_ms{
            destination_namespace="{{ namespace }}",
            destination_name=~"{{ target }}"
          }[{{ interval }}]
        )
      ) by (le)
    )

podinfo-canary.yaml:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  provider: "smi:v1alpha3"
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  service:
    # ClusterIP port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
  analysis:
    # schedule interval (default 60s)
    interval: 30s
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5
    # Prometheus checks
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"

Output excerpt of kubectl describe -f ./podinfo-canary.yaml:

Status:
  Canary Weight:  0
  Conditions:
    Last Transition Time:  2021-06-07T22:28:21Z
    Last Update Time:      2021-06-07T22:28:21Z
    Message:               New Deployment detected, starting initialization.
    Reason:                Initializing
    Status:                Unknown
    Type:                  Promoted
  Failed Checks:           0
  Iterations:              0
  Last Transition Time:    2021-06-07T22:28:21Z
  Phase:                   Initializing
Events:
  Type     Reason  Age                  From     Message
  ----     ------  ----                 ----     -------
  Warning  Synced  5m38s                flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
  Normal   Synced  8s (x12 over 5m38s)  flagger  all the metrics providers are available!
  Warning  Synced  8s (x11 over 5m8s)   flagger  TrafficSplit podinfo.test create error: the server could not find the requested resource (post trafficsplits.split.smi-spec.io)

Full output of kubectl describe -f ./podinfo-canary.yaml: https://pastebin.ubuntu.com/p/kB9qtPxZvr/

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Flagger version: 1.11.0
Kubernetes version: 1.19.11
Service Mesh provider: smi (through osm)
Ingress provider: N/A.

Ability to exclude annotations

Describe the feature

This might already exist, please point me in the direction if it does.

I'd like to be able to exclude (or include) specific annotation prefixes from being copied over to the primary deployments.

I noticed similar functionality exists for labels with --include-label-prefix but nothing exists for annotations.

For context, we're looking to use stakater/Reloader which allows us to reload pods when configmap and/or secrets change. This works well for us at the moment, and flagger also handles this gracefully too.

However, since Reloader is annotation based, the Reloader annotation config that we include in the original Deployment resource is then copied over to the Primary Deployment resource, meaning that when a configmap and/or secret changes, Reloader will patch both the Original and Primary Deployments at the same time as both will now include the annotation, whereas we would only want the Original Deployment to be patched (which would then trigger a Flagger rollout naturally)

I'd like to specify that Flagger should not include Reloader annotations, or similar to --include-label-prefix specify which annotations should be included.

Proposed solution

Replicate the behaviour of --include-label-prefix to a new argument --include-annotation-prefix

Happy to PR if you see value in this.

Flagger with StackDriver Metric Template: Request was missing field name

We are facing issues with MQL while trying to Integration StackDriver with Flagger to Perform Canary Analysis. We have a GKE Cluster Setup and Workload Identity Configured for the Service Account.

During the analysis, events reported are as below:

test             0s          Normal    Synced                    canary/ankit                                                 Starting canary analysis for podinfo.test
test             0s          Normal    Synced                    canary/ankit                                                 Pre-rollout check acceptance-test passed
test             0s          Normal    Synced                    canary/ankit                                                 Advance ankit.test canary weight 10
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.

I have a metric template as below that uses the query to fetch a sample metric which in this case if limit utilization.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
  namespace: test
spec:
  provider:
    type: stackdriver
  query: |
    fetch k8s_container
    | metric 'kubernetes.io/container/cpu/limit_utilization'
    | filter (resource.namespace_name == 'flagger-system')
    | align delta(1m)
    | every 1m
    | group_by 1m, [value_limit_utilization_mean: mean(value.limit_utilization)]

Could you please provide suggestions on where we might be going wrong?

build(deps): bump actions/cache from 3.0.11 to 3.2.2
Bumps actions/cache from 3.0.11 to 3.2.2.

Release notes

Sourced from actions/cache's releases.

v3.2.2

What's Changed

Fix formatting error in restore/README.md by @me-and in actions/cache#1044

save/README.md: Fix typo in example by @mmuetzel in actions/cache#1040

README.md: remove outdated Windows cache tip link by @me-and in actions/cache#1042

Revert compression changes related to windows but keep version logging by @Phantsure in actions/cache#1049

New Contributors

@me-and made their first contribution in actions/cache#1044

@mmuetzel made their first contribution in actions/cache#1040

Full Changelog: https://github.com/actions/cache/compare/v3.2.1...v3.2.2

v3.2.1

What's Changed

Release compression related changes for windows by @Phantsure in actions/cache#1039

Upgrade codeql to v2 by @Phantsure in actions/cache#1023

Full Changelog: https://github.com/actions/cache/compare/v3.2.0...v3.2.1

v3.2.0

What's Changed

fix wrong timeout env var key in README.md by @walterddr in actions/cache#959

Updated release doc with correct env variable by @kotewar in actions/cache#960

Create pull_request_template.md by @pdotl in actions/cache#963

Update README with clearer info about cache-hit and its value by @kotewar in actions/cache#961

Change datadog/squid to Ubuntu/squid in CI check by @bishal-pdMSFT in actions/cache#976

Add more details to version section in readme by @bishal-pdMSFT in actions/cache#971

Update hashFiles documentation reference by @asaf400 in actions/cache#979

Updated link for cache segment download info by @kotewar in actions/cache#986

Readme update for deleting caches by @t-dedah in actions/cache#981

Add oncall logic to assign issues and PRs by @vsvipul in actions/cache#997

Bump minimatch from 3.0.4 to 3.1.2 by @dependabot in actions/cache#998

Revert "Bump minimatch from 3.0.4 to 3.1.2" by @vsvipul in actions/cache#1005

Fix npm vulnerability by @Phantsure in actions/cache#1007

refactor: Use early return pattern to avoid nested conditions by @jongwooo in actions/cache#1013

Use cache in check-dist.yml by @jongwooo in actions/cache#1004

chore: Use built-in cache action to cache dependencies by @jongwooo in actions/cache#1014

Updated node example by @t-dedah in actions/cache#1008

Fix: Node npm doc example by @apascualm in actions/cache#1026

docs: fix an invalid link in workarounds.md by @teatimeguest in actions/cache#929

General Availability release for granular cache by @kotewar in actions/cache#1035 More details here on beta release.

New Contributors

@walterddr made their first contribution in actions/cache#959

@asaf400 made their first contribution in actions/cache#979

@jongwooo made their first contribution in actions/cache#1013

@apascualm made their first contribution in actions/cache#1026

@teatimeguest made their first contribution in actions/cache#929

... (truncated)

Changelog

Sourced from actions/cache's changelog.

3.0.11

Update toolkit version to 3.0.5 to include @actions/core@^1.10.0

Update @actions/cache to use updated saveState and setOutput functions from @actions/core@^1.10.0

3.1.0-beta.1

Update @actions/cache on windows to use gnu tar and zstd by default and fallback to bsdtar and zstd if gnu tar is not available. (issue)

3.1.0-beta.2

Added support for fallback to gzip to restore old caches on windows.

3.1.0-beta.3

Bug fixes for bsdtar fallback if gnutar not available and gzip fallback if cache saved using old cache action on windows.

3.2.0-beta.1

Added two new actions - restore and save for granular control on cache.

3.2.0

Released the two new actions - restore and save for granular control on cache

3.2.1

Update @actions/cache on windows to use gnu tar and zstd by default and fallback to bsdtar and zstd if gnu tar is not available. (issue)

Added support for fallback to gzip to restore old caches on windows.

Added logs for cache version in case of a cache miss.

3.2.2

Reverted the changes made in 3.2.1 to use gnu tar and zstd by default on windows.

Commits

4723a57 Revert compression changes related to windows but keep version logging (#1049)

d1507cc Merge pull request #1042 from me-and/correct-readme-re-windows

3337563 Merge branch 'main' into correct-readme-re-windows

60c7666 save/README.md: Fix typo in example (#1040)

b053f2b Fix formatting error in restore/README.md (#1044)

501277c README.md: remove outdated Windows cache tip link

c1a5de8 Upgrade codeql to v2 (#1023)

9b0be58 Release compression related changes for windows (#1039)

c17f4bf GA for granular cache (#1035)

ac25611 docs: fix an invalid link in workarounds.md (#929)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Progressive rollouts via pod readiness gate
Currently flagger supports advanced deployment strategies mainly via a service mesh or an ingress. These advanced methods are great but they also add extra complexity.

I suggest adding a new deployment method via pod readiness gate.

How it works:

A pod readiness gate is added to the deployment spec, for example:

readinessGates: - conditionType: "flagger.io/progress"

Once a rollout takes place and new pods are launched, the deployment progress will stop until flagger updates the readiness gate

Flagger performs an analysis

If the analysis passes flagger updates the readiness gate field in the new pods

The deployment progresses according to the rollingUpdate strategy and new pods are launched

Repeat

Advantages:

Native deployment object can be used, no need to create new deployments and shift traffic between them

No need for special considerations for HPA and configmaps

Can work with daemonsets and statefulsets as well

I believe this feature will make flagger much more approachable to a wider audience who is not using service meshes and will allow for a super simple onboarding while using existing production deployment/hpa resources with no need for migration.

Virtual Service gateway and host not update after Delegation enabled in canary

Describe the bug

Currently there is a feature toggle .Values.wbx3.canary.delegate in our canary resource definition to switch flagger virtual service used as delegate that will be called by another virtual service or take gateway traffic directly. Looks virtual service gateway and host can not update after delegation enable in canary.

To Reproduce

canary.yaml

  service:
    name: hello-world-flagger
    port: 8080
    portDiscovery: true
    {{ $delegateEnabled := lower $.Values.wbx3.canary.delegate}}
    {{- if eq $delegateEnabled "true" -}}
    delegation: true
    {{- else -}}
    gateways:
     - hello-world-web
    hosts:
     - hello-world.xxxx.xxxx
    {{- end }}

Set .Values.wbx3.canary.delegate to false, then trigger helm deploy.

$kubectl get vs
NAME                   GATEWAYS                   HOSTS                                             AGE                                                                                                          
hello-world-flagger    ["hello-world-web"]        ["hello-world.xxxx.xxxx","hello-world-flagger"]   48m

Set .Values.wbx3.canary.delegate to true, then trigger helm deploy.

$kubectl get vs
 NAME                  GATEWAYS                   HOSTS                                            AGE   
hello-world-flagger    ["hello-world-web"]        ["hello-world.xxxx.xxxx","hello-world-flagger"]   48m

Expected behavior

Gateway and host will auto removed when delegation enabled in the canary resource.

Additional context

Flagger version: 1.20.4
Kubernetes version: v1.21.5
Service Mesh provider: Istio
Ingress provider:

Using thresholds in datadog metrics

I have seen that during canary analysis, if we are using datadog custom metrics, we can only push data(metrics) for the canary instance into DD ! Is there any way we can enable a same logic as DD monitors? so for example when using a MetricTemplate for not-found-percentage, is there anyway we can tell flagger to stop the rollout if the provided query 100 - ( sum:istio.mesh.request.count{ reporter:destination, destination_workload_namespace:{{ namespace }}, destination_workload:{{ target }}, !response_code:404 }.as_count() / sum:istio.mesh.request.count{ reporter:destination, destination_workload_namespace:{{ namespace }}, destination_workload:{{ target }} }.as_count() ) * 100

is above a certain threshold that we set! Thanks.

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)

flagger

Documentation

Who is using Flagger

Canary CRD

Features

Roadmap

GitOps Toolkit compatibility

Integrations

Contributing

Getting Help

Owner

Flux project

Comments

Specifying multiple HTTP match uri in Istio Canary deployment via Flagger

Add canary finalizers

Gloo Canary Release Docs Discrepancy

Only unique values for domains are permitted error with Istio 1.1.0 RC1

Unable to perform Istio-A/B testing

istio no values found for metric request-success-rate

Canary ingress nginx prevent update of primary ingress due to admission webhook

Blue/Green deployment - ELB collides with ClusterIP Flagger services.

namespace.yaml

deployment.yaml

load-balancer.yaml

canary.yaml

progressDeadlineSeconds not working while waiting for rollout to finish

Add HTTP match conditions to Canary service spec

Flagger omits `TrafficSplit` backend service weight if weight is 0 due to `omitempty` option

Describe the bug

To Reproduce

Expected behavior

Additional context

Ability to exclude annotations

Describe the feature

Proposed solution

Flagger with StackDriver Metric Template: Request was missing field name

build(deps): bump actions/cache from 3.0.11 to 3.2.2

v3.2.2

What's Changed

New Contributors

v3.2.1

What's Changed

v3.2.0

What's Changed

New Contributors

3.0.11

3.1.0-beta.1

3.1.0-beta.2

3.1.0-beta.3

3.2.0-beta.1

3.2.0

3.2.1

3.2.2

Progressive rollouts via pod readiness gate

Virtual Service gateway and host not update after Delegation enabled in canary

Describe the bug

To Reproduce

canary.yaml

Expected behavior

Additional context

Using thresholds in datadog metrics

Related tags

An operator which complements grafana-operator for custom features which are not feasible to be merged into core operator

Terraform-operator - The Terraform Operator provides support to run Terraform modules in Kubernetes in a declaritive way as a Kubernetes manifest

An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

Kubernetes Operator Samples using Go, the Operator SDK and OLM

The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Minecraft-operator - A Kubernetes operator for Minecraft Java Edition servers

K8s-network-config-operator - Kubernetes network config operator to push network config to switches

Pulumi-k8s-operator-example - OpenGitOps Compliant Pulumi Kubernetes Operator Example

A kubernetes controller that watches the Deployments and “caches” the images

Operator Permissions Advisor is a CLI tool that will take a catalog image and statically parse it to determine what permissions an Operator will request of OLM during an install

Test Operator using operator-sdk 1.15

a k8s operator 、operator-sdk

The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

PolarDB-X Operator is a Kubernetes extension that aims to create and manage PolarDB-X cluster on Kubernetes.

Kubernetes Operator to sync secrets between different secret backends and Kubernetes

Continuous Delivery for Declarative Kubernetes, Serverless and Infrastructure Applications

The NiFiKop NiFi Kubernetes operator makes it easy to run Apache NiFi on Kubernetes.