Zeebe.io - Workflow Engine for Microservices Orchestration

Zeebe.io - Workflow Engine for Microservices Orchestration

Maven Central

Zeebe provides visibility into and control over business processes that span multiple microservices.

Why Zeebe?

  • Define workflows visually in BPMN 2.0
  • Choose your programming language
  • Deploy with Docker and Kubernetes
  • Build workflows that react to messages from Kafka and other message queues
  • Scale horizontally to handle very high throughput
  • Fault tolerance (no relational database required)
  • Export workflow data for monitoring and analysis
  • Engage with an active community

Learn more at zeebe.io

Status

Starting with Zeebe 0.20.0, the "developer preview" label was removed from Zeebe and the first production-ready version was released .

To learn more about what we're currently working on, please visit the roadmap.

Helpful Links

Recommended Docs Entries for New Users

Contributing

Read the Contributions Guide.

Code of Conduct

This project adheres to the Camunda Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior as soon as possible.

License

Zeebe source files are made available under the Zeebe Community License Version 1.0 except for the parts listed below, which are made available under the Apache License, Version 2.0. See individual source files for details.

Available under the Apache License, Version 2.0:

Clarification on gRPC Code Generation

The Zeebe Gateway Protocol (API) as published in the gateway-protocol is licensed under the Zeebe Community License 1.0. Using gRPC tooling to generate stubs for the protocol does not constitute creating a derivative work under the Zeebe Community License 1.0 and no licensing restrictions are imposed on the resulting stub code by the Zeebe Community License 1.0.

Comments
  • feat(client): Allow disabling environment variable override in Java client

    feat(client): Allow disabling environment variable override in Java client

    This would fix https://github.com/camunda/zeebe/issues/9401, which is needed to reliably run test cases using zeebe-process-test or spring-zeebe.

    Definition of Done

    Not all items need to be done depending on the issue and the pull request.

    Code changes:

    • [x] The changes are backwards compatibility with previous versions
    • [ ] If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

    Testing:

    • [x] There are unit/integration tests that verify all acceptance criterias of the issue
    • [ ] New tests are written to ensure backwards compatibility with further versions
    • [x] The behavior is tested manually
    • [ ] The change has been verified by a QA run
    • [ ] The impact of the changes is verified by a benchmark

    Documentation:

    • [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
    • [ ] New content is added to the release announcement
    • [ ] If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

    Please refer to our review guidelines.

  • Worker request ActivateJobs activate job in Zeebe but get an empty response

    Worker request ActivateJobs activate job in Zeebe but get an empty response

    Describe the bug Ramdomly and not often, a worker requesting ActivateJobs on Zeebe get an empty answer, but the job is activated in Zeebe so it will never be done.

    To Reproduce Make an integration test that is a loop of this sequence:

    • Deploy workflow with one service task.
    • Create workflow instance.
    • Wait for the job to be completed by the worker with a timeout.
    • Repeat...

    Request parameters:

            activate_jobs_response = stub.ActivateJobs(
                gateway_pb2.ActivateJobsRequest(
                    # the job type, as defined in the BPMN process
                    type=agent_job_type,
                    # the name of the agent activating the jobs, mostly used for logging purposes
                    worker=agent_job_type,
                    # a job returned after this call will not be activated by another call until the
                    # timeout( in ms) has been reached
                    timeout=43200000, # 12h
                    # the maximum jobs to activate by this request
                    maxJobsToActivate=5,
                    # a list of variables to fetch as the job variables; if empty, all visible variables at the time
                    # of activation for the scope of the job will be returned
                    # fetchVariable=[]
                    # The request will be completed when at least one job is activated or after the
                    # requestTimeout (in ms). if the requestTimeout = 0, a default timeout is used. if the
                    # requestTimeout < 0, long polling is disabled and the request is completed immediately,
                    # even when no job is activated.
                    requestTimeout=5
                )
            )
    

    Expected behavior If the worker ActivateJobs request activate jobs in Zeebe, they must be returned to worker in response of the request.

    Log/Stacktrace

    You'll find in the attached file:

    • ES exported event (help see the time of events)
    • TCPDUMP of API request/response between worker and Zeebe
    • Zeebe logs at level=ALL

    Zeebe ActivateJobs Bug.txt

    Environment:

    • OS: Ubuntu Mate 18.04 (worker), Ubuntu Mate 16.04 (Zeebe)
    • Zeebe Version: 0.25.1
    • Configuration:

    Python client:

    Package           Version
    ----------------- ---------
    certifi           2020.11.8
    chardet           3.0.4
    docker            4.4.0
    elasticsearch     6.8.1
    elasticsearch-dsl 6.4.0
    grpcio            1.33.2
    idna              2.10
    pip               20.2.4
    protobuf          3.13.0
    python-dateutil   2.8.1
    PyYAML            5.3.1
    requests          2.25.0
    setuptools        50.3.2
    six               1.15.0
    urllib3           1.26.2
    websocket-client  0.57.0
    wheel             0.35.1
    zeebe-grpc        0.25.1.0
    

    Zeebe config:

    --- # ----------------------------------------------------
    
    # Zeebe Standalone Broker configuration file (with embedded gateway)
    
    # This file is based on broker.standalone.yaml.template but stripped down to contain only a limited
    # set of configuration options. These are a good starting point to get to know Zeebe.
    # For advanced configuration options, have a look at the templates in this folder.
    
    # !!! Note that this configuration is not suitable for running a standalone gateway. !!!
    # If you want to run a standalone gateway node, please have a look at gateway.yaml.template
    
    # ----------------------------------------------------
    # Byte sizes
    # For buffers and others must be specified as strings and follow the following
    # format: "10U" where U (unit) must be replaced with KB = Kilobytes, MB = Megabytes or GB = Gigabytes.
    # If unit is omitted then the default unit is simply bytes.
    # Example:
    # sendBufferSize = "16MB" (creates a buffer of 16 Megabytes)
    #
    # Time units
    # Timeouts, intervals, and the likes, must be specified either in the standard ISO-8601 format used
    # by java.time.Duration, or as strings with the following format: "VU", where:
    #   - V is a numerical value (e.g. 1, 5, 10, etc.)
    #   - U is the unit, one of: ms = Millis, s = Seconds, m = Minutes, or h = Hours
    #
    # Paths:
    # Relative paths are resolved relative to the installation directory of the broker.
    zeebe:
      broker:
        gateway:
          # Enable the embedded gateway to start on broker startup.
          # This setting can also be overridden using the environment variable ZEEBE_BROKER_GATEWAY_ENABLE.
          enable: true
    
          network:
            # Sets the port the embedded gateway binds to.
            # This setting can also be overridden using the environment variable ZEEBE_BROKER_GATEWAY_NETWORK_PORT.
            port: 26500
          security:
            # Enables TLS authentication between clients and the gateway
            # This setting can also be overridden using the environment variable ZEEBE_BROKER_GATEWAY_SECURITY_ENABLED.
            enabled: false
    
        network:
          # Controls the default host the broker should bind to. Can be overwritten on a
          # per binding basis for client, management and replication
          # This setting can also be overridden using the environment variable ZEEBE_BROKER_NETWORK_HOST.
          host: 0.0.0.0
    
        data:
          # Specify a list of directories in which data is stored.
          # This setting can also be overridden using the environment variable ZEEBE_BROKER_DATA_DIRECTORIES.
          directories: [ "{{ zeebe_data_directories }}" ]
          # The size of data log segment files.
          # This setting can also be overridden using the environment variable ZEEBE_BROKER_DATA_LOGSEGMENTSIZE.
          logSegmentSize: 512MB
          # How often we take snapshots of streams (time unit)
          # This setting can also be overridden using the environment variable ZEEBE_BROKER_DATA_SNAPSHOTPERIOD.
          snapshotPeriod: 15m
    
        cluster:
          # Specifies the Zeebe cluster size.
          # This can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_CLUSTERSIZE.
          clusterSize: 1
          # Controls the replication factor, which defines the count of replicas per partition.
          # This can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR.
          replicationFactor: 1
          # Controls the number of partitions, which should exist in the cluster.
          # This can also be overridden using the environment variable ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT.
          partitionsCount: 1
    
        threads:
          # Controls the number of non-blocking CPU threads to be used.
          # WARNING: You should never specify a value that is larger than the number of physical cores
          # available. Good practice is to leave 1-2 cores for ioThreads and the operating
          # system (it has to run somewhere). For example, when running Zeebe on a machine
          # which has 4 cores, a good value would be 2.
          # This setting can also be overridden using the environment variable ZEEBE_BROKER_THREADS_CPUTHREADCOUNT
          cpuThreadCount: 2
          # Controls the number of io threads to be used.
          # This setting can also be overridden using the environment variable ZEEBE_BROKER_THREADS_IOTHREADCOUNT
          ioThreadCount: 2
        # Elasticsearch Exporter ----------
        # An example configuration for the elasticsearch exporter:
        #
        # These setting can also be overridden using the environment variables "ZEEBE_BROKER_EXPORTERS_ELASTICSEARCH_..."
        #
        exporters:
          elasticsearch:
            className: io.zeebe.exporter.ElasticsearchExporter
    
            args:
              url: http://elasticsearch:9200
    
              bulk:
                delay: 5
                size: 1000
    
              # authentication:
              #   username: elastic
              #   password: changeme
    
              index:
                prefix: zeebe-record
                createTemplate: true
    
                command: false
                event: true
                rejection: false
    
                deployment: true
                error: true
                incident: true
                job: true
                jobBatch: false
                message: false
                messageSubscription: false
                variable: true
                variableDocument: true
                workflowInstance: true
                workflowInstanceCreation: false
                workflowInstanceSubscription: false
    
                ignoreVariablesAbove: 32677
    
    
  • ZeebeWorker request ActivatedJob in zeebe but get no response

    ZeebeWorker request ActivatedJob in zeebe but get no response

    Describe the bug

    Ramdomly and not often, a worker requesting ActivateJobs on Zeebe get an empty answer, but the job is activated in Zeebe so it will never be done. I mentioned this bug in https://github.com/camunda/zeebe/issues/8759 and found it has been resolved in zeebe-1.3.6:https://github.com/camunda/zeebe/issues?q=label%3Arelease%2F1.3.6+is%3Aclosed。 @npepinpe @romansmirnov

    image

    Unfortunately it also appears in this version. In-depth analysis of the solution to this problem, i guess this 1.3.6 just resolves the issue about "When the broker activates jobs and before sending them back to the gateway, the gateway potentially could close its connection (e.g., the request timed out, network interruption, gateway died, etc.)", this bug fix maybe make sense. But so i know, the parameter named "zeebe.client.request-timeout" which can affect the performance between the job worker client and zeebe-gateway may led to the before mentioned issue. After a quickly code review, i found the code for zeebe-client-java has no changed from 1.3.5 to 1.3.6. So I think the 1.3.6 can not solve the problem when the job worker client miss the activated job in condition of it fails to receive the activated job and the gateway has received the activated job from the zeebe broker.

    Looking forward to your reply.

    To Reproduce

    image

    1.deploy the BPMN process image

    2.create instance constantly by different QPS

    Expected behavior no job miss execution.

    Environment: zeebe 1.3.7 java-client sdk

    PARTITIONSCOUNT=54 BROKER_CLUSTER_CLUSTERSIZE=18 request_time_out=10s

  • Unexpected drop of performance with 1.2

    Unexpected drop of performance with 1.2

    Describe the bug

    Running a benchmark with 1.2, one partition, one worker and starter we observed similar as described in the chaos day https://zeebe-io.github.io/zeebe-chaos/2021/10/05/recovery-time#performance a drop in throughput.

    general

    It seems that at some point the backpressure spikes up and lot of incoming requests are blocked / rejected. Currently it is unclear why. In the logs, nothing can be observed no error, nothing.

    To Reproduce

    I run the same benchmark as in the chaos day, and it happend again. I will attach the benchmark details as zip.

    Expected behavior

    Running smooth as the other benchmarks

    Log/Stacktrace

    Full Stacktrace

    <STACKTRACE>
    

    Environment:

  • Excessive number of log segments generated on one partition

    Excessive number of log segments generated on one partition

    Describe the bug We have come across an issue where we are seeing an excessive number of log segments getting generated for one of the partitions (cluster: 3 broker, 3 partitions, standalone gateway, no exporters). The number of segments generated for one of the partition has reached around 300, with 512MB (logSegmentSize) each. That is almost 150GB of data in our PVC (per broker), and we have started seeing some other issues as well.

    We are seeing continuous errors Failed to obtain a pending snapshot directory for position x for the last 1 day. And we are also seeing in Grafana that the Snapshot Operation has stopped since yesterday. Screenshot 2020-07-27 at 10 41 24 PM Screenshot 2020-07-27 at 10 41 39 PM

    We have also seen one other error coming in the broker logs:

    2020-07-26 17:44:44.667 [Broker-1-StreamProcessor-3] [Broker-1-zb-actors-1] ERROR io.zeebe.broker.workflow - Expected to find job with key 6755399445248597, but no job found
    

    To Reproduce On investigating the log segments, we found these errors in most of the segments:

    Expected to time out activated job with key '6755399445248597', but it must be activated firstŽ¨deadlineÏs‹éH‡¦worker±logistics§retries¤type±logisticscustomHeaders€©variablesÄ€¬errorMessageÚÅjavax.persistence.PersistenceException: org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
    

    After seeing these errors, we cancelled their corresponding workflow instances, expecting the compaction to get triggered. But the compaction did not get triggered and we still have the same number of raft-partition*.log files (around 300) present for that partition. At least after cancelling these instances, the data growth has reduced significantly, but still growing gradually (and without creating new segments! Talk about all the weird issues coming together! 😅 ).

    Expected behavior As per the doc, Log segment deletion will begin as soon as the max number of snapshots has been reached. But the segments have not been deleted and the data is continuously growing.

    Log/Stacktrace

    Full Stacktrace

    2020-07-27 15:03:20.979 [Broker-0-SnapshotDirector-2] [Broker-0-zb-fs-workers-1] WARN  io.zeebe.logstreams.snapshot - Failed to obtain a pending snapshot directory for position 11050954632720
    2020-07-27 15:11:51.468 [] [raft-server-0-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=182, leader=0, prevLogIndex=54210760, prevLogTerm=182, entries=0, commitIndex=54722997} to 2 failed: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request type raft-partition-partition-1-append timed out in 5000 milliseconds
    java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request type raft-partition-partition-1-append timed out in 5000 milliseconds
    	at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
    	at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
    	at java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
    	at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
    	at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
    	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$17(NettyMessagingService.java:482) ~[atomix-cluster-0.23.4.jar:0.23.4]
    	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) ~[guava-28.2-jre.jar:?]
    	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$18(NettyMessagingService.java:482) ~[atomix-cluster-0.23.4.jar:0.23.4]
    	at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
    	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
    	at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
    	at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
    	at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.23.4.jar:0.23.4]
    	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
    	at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
    	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
    	at java.lang.Thread.run(Unknown Source) [?:?]
    Caused by: java.util.concurrent.TimeoutException: Request type raft-partition-partition-1-append timed out in 5000 milliseconds
    	... 7 more
    2020-07-27 15:13:27.022 [Broker-0-SnapshotDirector-1] [Broker-0-zb-fs-workers-0] WARN  io.zeebe.logstreams.snapshot - Failed to obtain a pending snapshot directory for position 11987254449144
    2020-07-27 15:18:21.011 [Broker-0-SnapshotDirector-2] [Broker-0-zb-fs-workers-1] WARN  io.zeebe.logstreams.snapshot - Failed to obtain a pending snapshot directory for position 11050956241496
    

    Environment:

    • Zeebe Version: 0.23.4
    Full Stacktrace

    Starting broker 0 with configuration {
      "network" : {
        "host" : "0.0.0.0",
        "portOffset" : 0,
        "maxMessageSize" : "4MB",
        "advertisedHost" : "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local",
        "commandApi" : {
          "host" : "0.0.0.0",
          "port" : 26501,
          "advertisedHost" : "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local",
          "advertisedPort" : 26501,
          "address" : "0.0.0.0:26501",
          "advertisedAddress" : "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local:26501"
        },
        "internalApi" : {
          "host" : "0.0.0.0",
          "port" : 26502,
          "advertisedHost" : "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local",
          "advertisedPort" : 26502,
          "address" : "0.0.0.0:26502",
          "advertisedAddress" : "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local:26502"
        },
        "monitoringApi" : {
          "host" : "0.0.0.0",
          "port" : 9600,
          "advertisedHost" : "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local",
          "advertisedPort" : 9600,
          "address" : "0.0.0.0:9600",
          "advertisedAddress" : "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local:9600"
        },
        "maxMessageSizeInBytes" : 4194304
      },
      "cluster" : {
        "initialContactPoints" : [ "prod-zeebe-zeebe-broker-0.prod-zeebe-zeebe-broker.prod.svc.cluster.local:26502", "prod-zeebe-zeebe-broker-1.prod-zeebe-zeebe-broker.prod.svc.cluster.local:26502", "prod-zeebe-zeebe-broker-2.prod-zeebe-zeebe-broker.prod.svc.cluster.local:26502" ],
        "partitionIds" : [ 1, 2, 3 ],
        "nodeId" : 0,
        "partitionsCount" : 3,
        "replicationFactor" : 3,
        "clusterSize" : 3,
        "clusterName" : "zeebe-cluster",
        "gossipFailureTimeout" : 10000,
        "gossipInterval" : 250,
        "gossipProbeInterval" : 1000
      },
      "threads" : {
        "cpuThreadCount" : 2,
        "ioThreadCount" : 2
      },
      "data" : {
        "directories" : [ "/usr/local/zeebe/data" ],
        "logSegmentSize" : "512MB",
        "snapshotPeriod" : "PT15M",
        "logIndexDensity" : 100,
        "logSegmentSizeInBytes" : 536870912,
        "atomixStorageLevel" : "DISK"
      },
      "exporters" : { },
      "gateway" : {
        "network" : {
          "host" : "0.0.0.0",
          "port" : 26500,
          "minKeepAliveInterval" : "PT30S"
        },
        "cluster" : {
          "contactPoint" : "0.0.0.0:26502",
          "requestTimeout" : "PT15S",
          "clusterName" : "zeebe-cluster",
          "memberId" : "gateway",
          "host" : "0.0.0.0",
          "port" : 26502
        },
        "threads" : {
          "managementThreads" : 1
        },
        "monitoring" : {
          "enabled" : false,
          "host" : "0.0.0.0",
          "port" : 9600
        },
        "security" : {
          "enabled" : false,
          "certificateChainPath" : null,
          "privateKeyPath" : null
        },
        "enable" : false
      },
      "backpressure" : {
        "enabled" : true,
        "algorithm" : "VEGAS"
      },
      "stepTimeout" : "PT5M"
    }
    
    

  • [Similar] Long Polling is blocked even though jobs are available on the broker

    [Similar] Long Polling is blocked even though jobs are available on the broker

    Hi, I have similar problem. I noticed its 2 days ago when i had zeebe version 0.23.0, after that i updated cluster on 0.24.4 and this version fixed all my problem, but today that problem has returned and i don`t know what happened. I have 0.24.4 zeebe broker and zeebe gateway and spring zeebe.

    _Originally posted by @jooble in https://github.com/zeebe-io/zeebe/issues/4396#issuecomment-711688057

  • 4396 long polling blocked bug - target 0.23.0

    4396 long polling blocked bug - target 0.23.0

    Description

    This PR proposes a fix to the long polling blocked bug. I don't expect it be approved right away. Instead I hope for some critical comments and engaging discussion.

    Please read the issue and the analysis therein to understand what was fixed.

    Talking about the fixes:

    • for the backpressure the ActivateJobsHandler now tracks whether some brokers returned a resource exhausted error and if so, it tells the polling handler about it. I think this change is solid, albeit not very elegant.
    • for the time window where the long polling handler did not know that a request is currently running: Here the main idea is to register any active request right away. I think the approach is solid, but the implementation is somewhat lacking. It is quite big for a simple bugfix, and rather small for a major refactoring. Also, besides the intended change in behavior there are some unintended side effects. Like scheduling of long polling requests will change slightly in order. And the number of how often the Broker is queried also changes. The existing code had some downsides, in my opinion, so I took some liberty in deviating from the existing implementation. Having said that, it is hard to know which specific behavior was intended and which was arbitrary.

    Happy to discuss the details and answer any questions. Just want to create the review early given that it is a critical bug and I did the best I could.

    Related issues

    closes #4396

    Pull Request Checklist

    • [X] All commit messages match our commit message guidelines
    • [X] The submitting code follows our code style
    • [X] If submitting code, please run mvn clean install -DskipTests locally before committing
  • Multiple OOM encountered on benchmark cluster

    Multiple OOM encountered on benchmark cluster

    Describe the bug

    The benchmark cluster for branch release-1.3.0 experienced multiple Out Of Memory (OOM) errors.

    This is a potential regression, although it is likely this issue exists already longer. Note that the resources for the benchmark project were reduced recently. See #8268

    Occurrences

    zeebe-2 @ 2021-12-27 ~11:21:45

    Only a small dip in processing throughput Screen Shot 2022-01-03 at 16 29 54

    GC shortly spiked and then dropped Screen Shot 2022-01-03 at 16 22 41

    Simultaneously JVM memory usage increased from max ~200MB to spikes above 500MB and direct buffer pool memory usage doubled in this short window from ~400MB to ~860MB. Screen Shot 2022-01-03 at 16 23 51

    During this time, RocksDB memory usage was similar to before ~500MB per partition. Screen Shot 2022-01-03 at 16 27 37

    Install requests were frequently sent 🤔 Screen Shot 2022-01-03 at 16 33 56

    It had just transitioned to INACTIVE and had closed the database, when it started to transition to FOLLOWER. Once it opened the database it soon after stopped.

    2021-12-27 11:21:22.684 CET "Transition to INACTIVE on term 12 completed" 
    2021-12-27 11:21:22.734 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'." 
    2021-12-27 11:21:22.784 CET "Committed new snapshot 289647968-12-833412599-833412170" 
    2021-12-27 11:21:22.785 CET "Deleting previous snapshot 289590585-12-833268476-833246498" 
    2021-12-27 11:21:22.787 CET "Scheduling log compaction up to index 289647968" 
    2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}" 
    2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '289645353') and replace with received snapshot (index '289647968')" 
    2021-12-27 11:21:22.816 CET "Transition to FOLLOWER on term 12 requested." 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER" 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ExporterDirector" 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing SnapshotDirector" 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing StreamProcessor" 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing QueryService" 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ZeebeDb" 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStream" 
    2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStorage" 
    2021-12-27 11:21:22.817 CET "Preparing transition from INACTIVE on term 12 completed" 
    2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 starting" 
    2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 - transitioning LogStorage" 
    2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning LogStream" 
    2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-2-LogStream-2{status=HEALTHY}]" 
    2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning ZeebeDb" 
    2021-12-27 11:21:22.818 CET "Partition-2 recovered, marking it as healthy" 
    2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-2-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]" 
    2021-12-27 11:21:22.818 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}" 
    2021-12-27 11:21:22.915 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'." 
    2021-12-27 11:21:22.915 CET "Transition to FOLLOWER on term 12 - transitioning QueryService" 
    2021-12-27 11:21:22.916 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@649caa67)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2afb9843, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@6c21e905, configuration: Configuration(false)]" 
    2021-12-27 11:21:22.916 CET "Transition to FOLLOWER on term 12 - transitioning StreamProcessor"  
    2021-12-27 11:21:22.965 CET "request [POST http://elasticsearch-master:9200/_bulk] returned 1 warnings: [299 Elasticsearch-7.16.2-2b937c44140b6559905130a8650c64dbd0879cfb "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html to enable security."]"  
    2021-12-27 11:21:25.069 CET  ++ hostname -f
    

    zeebe-2 @ 2021-12-28 ~09:19:45, followed by zeebe-1 @ 2021-12-28 ~09:25:15

    Just before the OOM, the starter and worker restart, which might explain the loss of processing throughput. Screen Shot 2022-01-03 at 17 00 08

    Zeebe-2 has restarted at ~09:19:45 so the OOM should've happened just before that. Screen Shot 2022-01-03 at 17 01 13

    Zeebe 2 If we filter on that pod alone, we see that it was actually shortly processing as leader just before the OOM. Screen Shot 2022-01-03 at 17 04 40

    GC is much more quiet here before the OOM. JVM memory usage is about 600MB and direct buffer pool memory has just increased to this ~860MB again (just like before). RocksDB is still stable at 500MB per partition, no screenshot added. Screen Shot 2022-01-03 at 17 08 31

    Zeebe 2 did not produce any interesting logs, as far as I could tell.

    Zeebe 1 Zeebe-1 also does some processing as leader shortly before its OOM, ~5 min after zeebe-2 crashed. Screen Shot 2022-01-03 at 17 27 56

    Zeebe-1 looks a lot like zeebe-2 when we look at the memory decomposition. Note the increase in direct pool buffer memory just before the OOM like the other cases. Screen Shot 2022-01-03 at 17 30 13

    Partitions fully recovered, but about 1m30s after a snapshot was committed, an actor appears blocked. This means that the health tick is no longer updated. Directly after this, the pod dies.

    2021-12-28 09:22:16.229 CET "Partition-2 recovered, marking it as healthy"
    2021-12-28 09:22:16.229 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
    2021-12-28 09:22:16.667 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=HEALTHY}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
    2021-12-28 09:22:16.668 CET "Partition-1 recovered, marking it as healthy"
    2021-12-28 09:22:16.668 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Broker-1-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
    2021-12-28 09:22:21.703 CET "Taking temporary snapshot into /usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654."
    2021-12-28 09:22:21.907 CET "Finished taking temporary snapshot, need to wait until last written event position 1034640293 is committed, current commit position is 1034640235. After that snapshot will be committed."
    2021-12-28 09:22:21.933 CET "Current commit position 1034640293 >= 1034640293, committing snapshot FileBasedTransientSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654, checksum=890364779, metadata=FileBasedSnapshotMetadata{index=359974668, term=18, processedPosition=1034639769, exporterPosition=1034638654}}."
    2021-12-28 09:22:21.941 CET "Committed new snapshot 359974668-18-1034639769-1034638654"
    2021-12-28 09:22:21.942 CET "Deleting previous snapshot 359646996-17-1033697878-1033694056"
    2021-12-28 09:22:21.947 CET "Scheduling log compaction up to index 359974668"
    2021-12-28 09:22:21.951 CET "raft-partition-partition-3 - Deleting log up from 359633252 up to 359947572 (removing 21 segments)"
    2021-12-28 09:22:32.628 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
    2021-12-28 09:23:41.408 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
    2021-12-28 09:23:41.848 CET "Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
    2021-12-28 09:23:41.852 CET "Partition-1 failed, marking it as unhealthy: Broker-1{status=HEALTHY}"
    2021-12-28 09:23:41.852 CET "Detected 'UNHEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
    2021-12-28 09:23:41.861 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, Broker-1-Exporter-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-1-LogStream-2{status=HEALTHY}, Broker-1-SnapshotDirector-2{status=HEALTHY}, Broker-1-StreamProcessor-2{status=HEALTHY}]"
    2021-12-28 09:23:41.861 CET "Partition-2 recovered, marking it as healthy"
    2021-12-28 09:23:41.861 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
    2021-12-28 09:24:11.884 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-StreamProcessor-3{status=UNHEALTHY, issue='actor appears blocked'}, ZeebePartition-3{status=HEALTHY}, Broker-1-Exporter-3{status=HEALTHY}, raft-partition-partition-3{status=HEALTHY}, Broker-1-LogStream-3{status=HEALTHY}, Broker-1-SnapshotDirector-3{status=HEALTHY}]"
    2021-12-28 09:24:39.044 CET ++ hostname -f
    

    zeebe-2 @ 2021-12-28 ~22:50:00

    Again only a small dip in processing throughput (nice and quick failover 🚀 ) Screen Shot 2022-01-03 at 18 17 51

    Zeebe-2 was leader and processing before OOM Screen Shot 2022-01-03 at 18 16 52

    Interestingly, the logs just before the restart of zeebe-2 at this time, are practically identical to the logs of zeebe-2 on the first OOM (the day before on the 27th).

    Zeebe-2 had just transitioned to INACTIVE and closed the database. It was transitioning to FOLLOWER again and just after it opened the database it is transitioning the StreamProcessor. Which is the same transition it OOM-ed at the day before.

    2021-12-28 22:49:37.461 CET "Transition to INACTIVE on term 16 completed"
    2021-12-28 22:49:37.537 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
    2021-12-28 22:49:37.624 CET "Committed new snapshot 397383357-16-1142860979-1142860461"
    2021-12-28 22:49:37.625 CET "Deleting previous snapshot 397252741-16-1142453760-1142669843"
    2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
    2021-12-28 22:49:37.631 CET "Scheduling log compaction up to index 397383357"
    2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Delete existing log (lastIndex '397286333') and replace with received snapshot (index '397383357')"
    2021-12-28 22:49:37.670 CET "Transition to FOLLOWER on term 16 requested."
    2021-12-28 22:49:37.670 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER"
    2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ExporterDirector"
    2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing SnapshotDirector"
    2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing StreamProcessor"
    2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing QueryService"
    2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ZeebeDb"
    2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStream"
    2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStorage"
    2021-12-28 22:49:37.671 CET "Preparing transition from INACTIVE on term 16 completed"
    2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 starting"
    2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 - transitioning LogStorage"
    2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning LogStream"
    2021-12-28 22:49:37.672 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-2-LogStream-1{status=HEALTHY}]"
    2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning ZeebeDb"
    2021-12-28 22:49:37.672 CET "Partition-1 recovered, marking it as healthy"
    2021-12-28 22:49:37.673 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Broker-2-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
    2021-12-28 22:49:37.673 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
    2021-12-28 22:49:37.837 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
    2021-12-28 22:49:37.838 CET "Transition to FOLLOWER on term 16 - transitioning QueryService"
    2021-12-28 22:49:37.840 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@2465e772)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2495802c, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@263ffbf0, configuration: Configuration(false)]"
    2021-12-28 22:49:37.841 CET "Transition to FOLLOWER on term 16 - transitioning StreamProcessor"
    2021-12-28 22:49:39.701 CET ++ hostname -f
    

    If you look at the logs from before that time, for a long period (at least multiple hours) it keeps transitioning between follower and inactive and the opposite direction. It's in a loop:

    2021-12-28 16:36:14.030 CET partition-3 "Transition to LEADER on term 21 requested."
    2021-12-28 16:36:14.127 CET partition-3 "Transition to LEADER on term 21 completed"
    2021-12-28 16:36:19.476 CET partition-2 "Transition to LEADER on term 19 requested."
    2021-12-28 16:36:19.590 CET partition-2 "Transition to LEADER on term 19 completed"
    2021-12-28 16:44:13.078 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 16:44:13.084 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 16:44:13.301 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 16:44:13.514 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 16:54:13.701 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 16:54:13.705 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 16:54:13.987 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 16:54:14.206 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 16:59:14.028 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 16:59:14.032 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 16:59:14.294 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 16:59:14.541 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:04:14.683 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:04:14.687 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 17:04:15.346 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 17:04:15.545 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:09:15.002 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:09:15.006 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 17:09:15.233 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 17:09:15.492 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:14:15.248 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:14:15.253 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 17:14:15.631 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 17:14:15.891 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:19:15.953 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:19:15.956 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 17:19:16.219 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 17:19:16.428 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:24:15.936 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:24:15.940 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 17:24:16.216 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 17:24:16.425 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:29:17.013 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:29:17.016 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 17:29:17.265 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 17:29:17.482 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:34:17.042 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:34:17.046 CET partition-1 "Transition to INACTIVE on term 16 completed"
    2021-12-28 17:34:17.380 CET partition-1 "Transition to FOLLOWER on term 16 requested."
    2021-12-28 17:34:17.585 CET partition-1 "Transition to FOLLOWER on term 16 completed"
    2021-12-28 17:39:17.481 CET partition-1 "Transition to INACTIVE on term 16 requested."
    2021-12-28 17:39:17.484 CET partition-1 "Transition to INACTIVE on term 16 completed"
    .... and so on, until 22:50:00
    

    This also happened the day before: https://cloudlogging.app.goo.gl/7qpb4Rammh11eqYh6

    Hypothesis Looking at the above cases, it seems that a partition gets stuck in a transition loop between FOLLOWER and INACTIVE. Perhaps we have a memory leak in transitions.

  • fix(broker): fix OOM on restart

    fix(broker): fix OOM on restart

    Description

    Back port bug fix for #4871

    Related issues

    closes #4927

    Definition of Done

    Code changes:

    • [X] The changes are backwards compatibility with previous versions
    • [X] If it fixes a bug then PRs are created to backport the fix to the last two minor versions

    Testing:

    • [X] There are unit/integration tests that verify all acceptance criterias of the issue
    • [ ] New tests are written to ensure backwards compatibility with further versions
    • [ ] The behavior is tested manually
    • [ ] The impact of the changes is verified by a benchmark

    Documentation:

    • [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
    • [ ] New content is added to the release annoncement
  • chore(dist): suppress Kryo illegal access warnings

    chore(dist): suppress Kryo illegal access warnings

    Description

    • opens Java 9 modules for reflection access required by Kryo, suppressing the illegal access warnings at start up

    Related issues

    closes #3711

    Pull Request Checklist

    • [x] All commit messages match our commit message guidelines
    • [x] The submitting code follows our code style
    • [x] If submitting code, please run mvn clean install -DskipTests locally before committing
  • [Backport 0.24] Fix resource managment in ZeebePartition

    [Backport 0.24] Fix resource managment in ZeebePartition

    Description

    Backports:

    • #4857
    • https://github.com/zeebe-io/zeebe/pull/4913
    • https://github.com/zeebe-io/zeebe/pull/4915

    Related issues

    closes #4810 closes #4844 closes #4847 closes #4686 closes https://github.com/zeebe-io/zeebe/issues/4936 closes #4910 closes #4909 closes #4907 closes #4912 closes https://github.com/zeebe-io/zeebe/issues/4936

    Definition of Done

    Not all items need to be done depending on the issue and the pull request.

    Code changes:

    • [ ] The changes are backwards compatibility with previous versions
    • [ ] If it fixes a bug then PRs are created to backport the fix to the last two minor versions

    Testing:

    • [ ] There are unit/integration tests that verify all acceptance criterias of the issue
    • [ ] New tests are written to ensure backwards compatibility with further versions
    • [ ] The behavior is tested manually
    • [ ] The impact of the changes is verified by a benchmark

    Documentation:

    • [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
    • [ ] New content is added to the release annoncement
  • Advise OS on mmap usage

    Advise OS on mmap usage

    Description

    I ran a test which showed a clear performance improvement when advising the OS on how we would use the memory mapped segment files using madvise.

    Note There's also a POSIX variant which is more portable called posix_madvise. Note however that it has less flags/options.

    In my test, I used only MADV_SEQUENTIAL. I would recommend we implement the following:

    • [ ] Add native LibC bindings to madvise via jnr-ffi (feel free to use something else if it makes sense)
    • [ ] When we load a segment (new or existing), advise the OS of the MADV_SEQUENTIAL. This yields a notable performance improvement. This is likely due to reducing read I/O (via more aggressive read-ahead), thereby reducing read/write I/O contention.

    If possible, evaluate the following as well:

    • [ ] When a reader/writer is pointing to a segment, advise the OS about the MADV_WILLNEED flag to trigger read-ahead immediately.
    • [ ] When no readers/writers are pointing to a segment, advise the OS about the MADV_DONTNEED (or on Linux, MADV_PAGEOUT/MADV_COLD). This will advise the OS to free those pages first (I believe MADV_PAGEOUT is to free immediately).

    The latter two can be done as a follow up depending on how much time is spent in the first two.

  • ci(jenkins): removal of Jenkins ci and jobs

    ci(jenkins): removal of Jenkins ci and jobs

    Description

    Removes all Jenkins related setup except the github repository scan of the zeebe-io org as there are projects that make use of Jenkins still. With the release job being migrated to GHA with https://github.com/camunda/zeebe/issues/10992 and the 8.2.0-alpha3 as well as 8.0 and 8.1 patches being successfully published, this also removes the deprecated release job on Jenkins.

    I already seeded the jenkins ci from this branch the outcome is as expected: image

    Given the camunda org job setup is gone by Jenkins seeding from main, I see no need for a backport, as Jenkins won't be building zeebe at all anymore.

    Related issues

    closes #9138

  • Add sequencer/log storage metrics to dashboard

    Add sequencer/log storage metrics to dashboard

    Description

    We've added sequencer and log storage metrics, but the dashboard was not updated to show them. Please add the following visualization:

    • [ ] Time series (line graph) of the sequencer queue size
    • [ ] Distribution of entry count per sequenced batch
    • [ ] Distribution of the size, in bytes, per sequenced batch

    Distributions should be at least as a histogram, and possibly an additional graph showing the quantiles.

  • Additional log storage metrics

    Additional log storage metrics

    Description

    It would be helpful when optimizing our log stream to have log storage metrics, namely:

    • [ ] Size of the BufferWriter or ByteBuffer we append
    • [ ] Number of entries in a batch per append

    The last one we have as part of the sequencer, however it may be interesting measure downstream in case we ever end up batching several sequencer entries together.

    One idea as a solution is to wrap the given LogStorage in the log stream builder with an instrumented version (which just delegates the real logic to the given storage). Another option is to simply add the metric in the LogStorageAppender (though we will likely get rid of it, so perhaps instrumenting the storage is a better, future-proof option).

  • [POC]: Batch processing

    [POC]: Batch processing

    Description

    Situation:

    In Zeebe we use something called stream processing to drive the process execution. This means we have divided our process execution into several commands our Engine can understand, the commands are requests to change a certain state (e.g. of a process instance). When these commands are processed the resulting state changes are reflected as events. In order to drive further the process execution there are also follow-up commands produced.

    Problem:

    Right now the stream processing is really fine granular. This means we have for each element and state in a BPMN process model command to change the state. If we start a process instance, we first create the "ground construct" and produce follow-up event to drive the execution later further. This of course allows us to run many instances in "parallel", since we can continue each of them in small steps. This allows a higher throughput at cost of process execution latency, which might be in some cases problematic.

    To be more specific each small batch (of commands and events) has a certain overhead since we have to wait for replication and commit until we can process the next command. Right now the commit latency has an high impact on the process execution time.

    Idea:

    Instead to produce small batches of commands and events, every time we process a command, we want to process an process instance until we reach a wait state. Wait state can be defined as no further command are produced to further drive the execution. This means we process all follow-up commands directly, and collect all results and will write them as batch to the log.

    With direct processing the commit latency impact will be reduced as well.

    Related issues

    closes #

    Definition of Done

    Not all items need to be done depending on the issue and the pull request.

    Code changes:

    • [ ] The changes are backwards compatibility with previous versions
    • [ ] If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

    Testing:

    • [ ] There are unit/integration tests that verify all acceptance criterias of the issue
    • [ ] New tests are written to ensure backwards compatibility with further versions
    • [ ] The behavior is tested manually
    • [ ] The change has been verified by a QA run
    • [ ] The impact of the changes is verified by a benchmark

    Documentation:

    • [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
    • [ ] If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

    Other teams: If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

    Please refer to our review guidelines.

  • Investigate impact of Messaging metrics

    Investigate impact of Messaging metrics

    Description

    With https://github.com/camunda/zeebe/pull/11353 we added new metrics to the messaging service. Unfortunately, some of them have high cardinality, since they use the IP address as labels. This was intended to detect problems in IP resolving and whether IPs are reused for a long time, which was found in #11307.

    When implementing #11353, we started with adding configurations, but it was not a small thing to do, since Gateway for example doesn't provide ExperimentConfiguration etc. We post poned if after discussion here https://github.com/camunda/zeebe/pull/11353#discussion_r1061591262, the backup of the branch can be found here https://github.com/camunda/zeebe/tree/zell-metrics-messaging-bk

    It would be good to evaluated whether this is really an issue, and if this is the case reintroduce a feature-flag to be able to disable the metrics.

    Note: This should happen before the next minor version

Go-kit-microservices - Example microservices implemented with Go Kit

Go Kit Microservices Example microservices implemented with go kit, a programmin

Jan 18, 2022
Easegress (formerly known as EaseGateway)is an all-rounder traffic orchestration system
Easegress (formerly known as EaseGateway)is an all-rounder traffic orchestration system

Easegress (formerly known as EaseGateway)is an all-rounder traffic orchestration system

Dec 28, 2022
Kagetaka : Decentralized Microservice Orchestration for Saito

Caravela : Decentralized Microservice Orchestration Shamelessly cloned and repurposed for Go 1.17 from André Pires: https://github.com/strabox/caravel

Oct 15, 2021
Realize is the #1 Golang Task Runner which enhance your workflow by automating the most common tasks and using the best performing Golang live reloading.
Realize is the #1 Golang Task Runner which enhance your workflow by automating the most common tasks and using the best performing Golang live reloading.

#1 Golang live reload and task runner Content - ⭐️ Top Features - ???? Get started - ?? Config sample - ?? Commands List - ?? Support and Suggestions

Dec 31, 2022
Awpark - Development kit via Alfred Workflow
Awpark - Development kit via Alfred Workflow

AWPark Alfred Workflow for engineer. Alfred Workflow Store Search and install Wo

Oct 26, 2022
A standard library for microservices.

Go kit Go kit is a programming toolkit for building microservices (or elegant monoliths) in Go. We solve common problems in distributed systems and ap

Jan 1, 2023
Design-based APIs and microservices in Go
Design-based APIs and microservices in Go

Goa is a framework for building micro-services and APIs in Go using a unique design-first approach. Overview Goa takes a different approach to buildin

Jan 5, 2023
goTempM is a full stack Golang microservices sample application built on top of the Micro platform.
goTempM is a full stack Golang microservices sample application built on top of the Micro platform.

goTempM is a full stack Golang microservices sample application built on top of the Micro platform.

Sep 24, 2022
Sample cloud-native application with 10 microservices showcasing Kubernetes, Istio, gRPC and OpenCensus.
Sample cloud-native application with 10 microservices showcasing Kubernetes, Istio, gRPC and OpenCensus.

Online Boutique is a cloud-native microservices demo application. Online Boutique consists of a 10-tier microservices application. The application is

Dec 31, 2022
Microservices using Go, RabbitMQ, Docker, WebSocket, PostgreSQL, React
Microservices using Go, RabbitMQ, Docker, WebSocket, PostgreSQL, React

Microservices A basic example of microservice architecture which demonstrates communication between a few loosely coupled services. Written in Go Uses

Jan 1, 2023
Go microservices with REST, and gRPC using BFF pattern.
Go microservices with REST, and gRPC using BFF pattern.

Go microservices with REST, and gRPC using BFF pattern. This repository contains backend services. Everything is dockerized and ready to

Jan 4, 2023
TinyHat.Me: Microservices deployed with Kubernetes that enable users to propose hat pictures and try on hats from a user-curated database.
TinyHat.Me: Microservices deployed with Kubernetes that enable users to propose hat pictures and try on hats from a user-curated database.

Click here to see the "buggy" version ?? The Scenario TinyHat.Me is an up and coming startup that provides an API to allow users to try on tiny hats v

Jun 17, 2022
This is an example to demonstrate implementation golang microservices using domain driven design principles and sugestions from go-kit

go-kit DDD Domain Driven Design is prevelent and rising standard for organizing your microservice code. This design architecture emphasis on Code orga

Feb 9, 2022
Box is an incrementally adoptable tool for building scalable, cloud native, microservices.

Box is a tool for building scalable microservices from predefined templates. Box is currently in Beta so if you find any issues or have some ideas

Feb 3, 2022
Best microservices framework in Go, like alibaba Dubbo, but with more features, Scale easily.
Best microservices framework in Go, like alibaba Dubbo, but with more features, Scale easily.

Best microservices framework in Go, like alibaba Dubbo, but with more features, Scale easily.

Dec 30, 2022
Access to b2c microservices through this service
Access to b2c microservices through this service

API service Access to b2c microservices through this service Config file Create config file with services addresses. Services: vdc - get camera inform

Nov 8, 2021
This is demo / sample / example project using microservices architecture for Online Food Delivery App.

Microservices This is demo / sample / example project using microservices architecture for Online Food Delivery App. Architecture Services menu-servic

Nov 21, 2022
Goya circuit is a circuit breaker mechanism implementation in microservices.

Goya-Circuit: 类似于Hystrix的熔断器实现 Goya circuit is a circuit breaker mechanism implementation in microservices. It can prevent the whole link avalanche ca

Mar 8, 2022
Example golang microservices deployed on kubernetes.
Example golang microservices deployed on kubernetes.

Tech Stack Golang RabbitMQ Docker K8S MongoDB Services There are two services which communicate via http(sync) and rabbitmq(async). Services opened to

Sep 6, 2022