Language-agnostic persistent background job server

Faktory Travis Build Status Go Report Card Gitter

At a high level, Faktory is a work server. It is the repository for background jobs within your application. Jobs have a type and a set of arguments and are placed into queues for workers to fetch and execute.

You can use this server to distribute jobs to one or hundreds of machines. Jobs can be executed with any language by clients using the Faktory API to fetch a job from a queue.

webui

Basic Features

  • Jobs are represented as JSON hashes.
  • Jobs are pushed to and fetched from queues.
  • Jobs are reserved with a timeout, 30 min by default.
  • Jobs FAIL'd or not ACK'd within the reservation timeout are requeued.
  • FAIL'd jobs trigger a retry workflow with exponential backoff.
  • Contains a comprehensive Web UI for management and monitoring.

Installation

See the Installation wiki page for current installation methods. Here's more info on installation with Docker and AWS ECS.

Documentation

Please see the Faktory wiki for full documentation.

Support

You can find help in the contribsys/faktory chat channel. Stop by and say hi!

Author

Mike Perham, @getajobmike, mike @ contribsys.com

Owner
Contributed Systems
The company behind the Sidekiq and Faktory job systems
Contributed Systems
Comments
  • Cron not running

    Cron not running

    Cron's have suddenly stopped running. Nothing in the Faktory server logs.

    The jobs don't get created at all. In fact, i make sure the queue they are going into does not exist, and it doesn't even create the queue.

    I don't know what to give you here to debug, since the issue is that just nothing happens. The cron countdown happens on the cron UI, but then it just goes to the next run.

    About to restart the Faktory server.

  • Worker redeploy not sending jobs back to queue + process showing as active when its actually been killed

    Worker redeploy not sending jobs back to queue + process showing as active when its actually been killed

    We had a job running on our 'dashboard-faktory" process, and I stopped that container (TSTP, TERM, sleep(40), KILL), however, the job was not returned to the queue. 

    The worker also stayed on the busy page for a good few minutes after it was killed too.

    Screen-Shot-2020-01-10-11-02-53 33

    As you can see here, the job is shown as running, and the process is shown as active, however, the actual server process was killed (TSTP, TERM, sleep(40), KILL) well over 1 minute before this.

    Eventually, the process disappeared, but now the job is stuck as busy. This will cause huge issues at production scale, as our concurrency limits will be eaten up by jobs that are not actually running.

    Screen-Shot-2020-01-10-11-03-33 16

  • Kubernetes support?

    Kubernetes support?

    What should Faktory look like in a world full of Kubernetes? My understanding is that Kubernetes could be very useful in scaling worker processes as queues grow. How can Faktory make this easy?

  • Build Docker images

    Build Docker images

    Build a Docker image with the following commands:

    git clone https://github.com/contribsys/faktory.git
    cd faktory && git checkout v0.5.0
    GOLANG_VERSION=1.9.1 ROCKSDB_VERSION=5.7.3 TAG=0.5.0 docker-compose build
    

    That assumes there is a branch or tag v0.5.0.

    That will build an image: contribsys/faktory:0.5.0

    Notice that the image tag is set with an env var regardless of what version of Faktory you have checked out. So take care that you set a tag that properly corresponds to what you're building.

    If you then do a docker-compose push it will push the image to Docker Hub for anyone to download and use (assuming you're the owner of contribsys on Docker Hub).

    Run a container in the foreground like so:

    docker run --rm -it -p 7419:7419 -p 7420:7420 contribsys/faktory:0.5.0 -b :7419 -no-tls
    

    Addresses #13.

  • Workers not receiving any new jobs

    Workers not receiving any new jobs

    Over the past week or so, we have had a large % of our workers not receiving any jobs. The only way we can fix this is by killing the workers and letting them restart (I normally use Stop All in the UI).

    There are no errors in the worker logs, or the server logs. Need some help debugging this. I am not able to replicate it either, other than when it randomly happens.

  • Add client/pool to provide a thread-safe pool of clients

    Add client/pool to provide a thread-safe pool of clients

    client/pool/pool.go provides a pool of clients and manages their lifecycle appropriately.

    This is the initial implementation and may not be exactly what you want, @mperham, but I'd love to go back and forth to arrive at a final implementation.

  • Home tab becomes unresponsive after the app runs for a while

    Home tab becomes unresponsive after the app runs for a while

    We're running faktory 0.7.0 inside of docker with both node.js and golang workers.

    After restarting the faktory service, every tab (Home, Busy, Queues, etc.) are snappy; however, after a short while the Home tab begins to load slower and slower until it doesn't load at all.

    Jobs continue to process just fine. The only thing noticed is the issue with the home tab.

    Anything we should be checking on our end? Thanks

  • Closed connections and timeouts on PUSH and FETCH

    Closed connections and timeouts on PUSH and FETCH

    We're experiencing sporadic timeouts and connection resets on both PUSH and FETCH, from Python and Node.

    • Which Faktory package and version?

      • Pro 1.4.0
    • Which Faktory worker package and version?

      • faktory-worker-node, 3.3.7 and @next
      • faktory-worker-python, 0.4
    • Please include any relevant worker configuration

      • Workers are running on Heroku, Faktory is running on AWS EKS. Traffic thus goes over the internet from worker <-> server. The python worker connects over tcp+tls. The node worker connects to a local stunnel instance which then maintains the connection to the server (same setup as connecting to Redis over TLS for heroku). See the stunnel conf here.
    • Please include any relevant error messages or stacktraces

      • Node:
        • Closed connections:

          Error: Connection closed
            at Connection.onClose (/usr/src/api/node_modules/faktory-worker/lib/connection.js:120:41)
            at /usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:54:19
            at Scope._activate (/usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/async_hooks.js:51:14)
            at Scope.activate (/usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:12:19)
            at Socket.bound (/usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:53:20)
            at Socket.emit (events.js:310:20)
            at Socket.EventEmitter.emit (domain.js:482:12)
            at TCP.<anonymous> (net.js:672:12)
          
        • Resets:

          at TCP.onStreamRead (internal/stream_base_commons.js:200:27)
          Error: read ECONNRESET
          
        • Timeouts:

          at processTimers (internal/timers.js:475:7)
          at listOnTimeout (internal/timers.js:531:17)
          at Timeout.bound (/app/node_modules/generic-pool/lib/ResourceRequest.js:8:15)
          at ResourceRequest._fireTimeout (/app/node_modules/generic-pool/lib/ResourceRequest.js:62:17)
          TimeoutError: ResourceRequest timed out
          
        • stunnel:

          2020.08.06 04:44:23 LOG5[7333]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
          2020.08.06 04:44:23 LOG3[7333]: No more addresses to connect
          2020.08.06 04:44:23 LOG3[7333]: s_connect: s_poll_wait 174.129.155.89:7419: TIMEOUTconnect exceeded
          
      • Python:
        timeout: timed out
        File "inventory/common/faktory/worker.py", line 32, in __call__
          return self._task.func(*args)
        [...snip...]
        File "inventory/worker/client.py", line 28, in queue
          return _client.queue(*args, **kwargs)
        File "inventory/common/faktory/client.py", line 66, in queue
          with faktory.connection() as client:
        File "contextlib.py", line 112, in __enter__
          return next(self.gen)
        File "__init__.py", line 18, in connection
          c.connect()
        File "faktory/client.py", line 21, in connect
          self.is_connected = self.faktory.connect()
        File "faktory/_proto.py", line 64, in connect
          self.socket.connect((self.host, self.port))
        File "ssl.py", line 1172, in connect
          self._real_connect(addr, False)
        File "ssl.py", line 1159, in _real_connect
          super().connect(addr)
        ``` 
        
      • Faktory server: Seeing a bunch of Bad connection: read tcp 10.0.78.94:7419->10.0.70.174:52255: i/o timeout
    • Tests:

      • I performed a load test using Locust, running locally. The job was a simple ping job with a payload of random length which just prints "pong" and then schedules another ping job (to send sending a job within a job). The Faktory server was running in the same k8s cluster as the one above with the following requests/limits: 550m/1500m CPU and 250Mi/300Mi Memory. The worker ran locally on my machine (to simulate the traffic over the internet). Results:
        • 50 concurrent users: 33 failures (connection reset by peer) out of 11268 pushes (0.3%). Average throughput: 23
        • 150 concurrent users: 41 failures out of 13665 pushes. Average throughput: 60/s
    • Are you using an old version?

      • No
    • Have you checked the changelogs to see if your issue has been fixed in a later version?

      • Yes
  • Evaluate Badger

    Evaluate Badger

    RocksDB is a pretty painful dependency due to C++: it forces a more complex build, makes finding memory leaks much harder, and is a black-box when it comes to profiling and tuning.

    Badger is a new storage layer written in Go which explicitly bills itself as a competitor to RocksDB. Right now the load test (make load) reports about 5000 job/sec. If we can stay near that or better, I'd seriously consider moving.

    https://github.com/dgraph-io/badger

  • Add .travis.yml

    Add .travis.yml

    So, there are some dragons in building this on Travis:

    1. As mentioned here: https://github.com/contribsys/faktory/issues/23, rocksdb appears to be trying to build in compression support if any compression headers are found (so we remove zlib and libbz2 dev packages)
    2. Travis moves the checkout for (forked) branches based off the repo it pulls from which breaks dependencies unless you move it to $GOPATH/src/github.com/contribsys/faktory
  • Job Priority

    Job Priority

    So, I ended up finding an implementation of brodal queues that seem to work nicely here, refactored the interface a bit but threw the original license in the source.

    I wrote a simple go program for loading data into faktory so I could compare load times and memory profiles. The queue priority structures add pretty much no memory overhead and capping the priority levels to 10, have pretty much the same performance as without the prioritization for enqueuing and processing.

    Here are the runtimes for faktory without job prioritization using the script in test/load/main.go:

    ➜  load git:(priority) ✗ ./load push 100000
    Enqueued 100000 jobs in 6.870994 seconds, rate: 14553.935072 jobs/s
    ➜  load git:(priority) ✗ ./load pop 100000
    Processed 100000 jobs in 17.172457 seconds, rate: 5823.278658 jobs/s
    

    Here are the runtimes for faktory with job prioritization:

    ➜  load git:(priority) ✗ ./load push 100000
    Enqueued 100000 jobs in 6.994321 seconds, rate: 14297.312592 jobs/s
    ➜  load git:(priority) ✗ ./load pop 100000
    Processed 100000 jobs in 16.859516 seconds, rate: 5931.368221 jobs/s
    

    Here's a screenshot of the debug page with 100k jobs enqueued with faktory sans prioritization:

    standard-100k

    And here's with prioritization for 100k:

    priority-100k

    Finally, here's with the backpressure bumped up to 1 mil jobs in the queue:

    priority-1mil

    I also started to clean up a bit of the typing around signed v. unsigned integers for things like queue size (since you should never have a negative queue size). Still more work to be done, but figure that can come in another PR. Also, it makes key generation a lot more simple to reason around since you can use binary.BigEndian.PutUint64 instead of doing the bit shifting manually.

    Let me know what you think.

  • Can't filter/search retrying jobs

    Can't filter/search retrying jobs

    • Which Faktory package and version? faktory-ent 1.6.1

    When there are many retry jobs, it's impossible to selectively delete jobs of a certain type (e.g., by searching/filtering), and the pagination UI does not make it easy either. This puts us in a very difficult position when there are thousands of jobs failing that should be killed among jobs that should not be killed.

    Sidekiq Enterprise allows filtering, which makes it easier. Also the plugin model enables one to do more.

    How can this use case be handled with Faktory?

  • Create benchmark suite

    Create benchmark suite

    We need a set of benchmarks which can exercise various features and provide some basic performance numbers.

    • push 100,000 jobs
    • Bulk push 100,000 jobs
    • Process 200,000 no-op jobs using fwg
    • process a batch of 100 child batches with 2000 jobs each
    • etc
  • Silently dropping Jobs with transaction discarded warnings

    Silently dropping Jobs with transaction discarded warnings

    • Which Faktory package and version?

    docker.contribsys.com/contribsys/faktory-ent:1.4.0

    • Which Faktory worker package and version?

    https://hackage.haskell.org/package/faktory-1.1.1.0

    • Please include any relevant worker configuration
    • Please include any relevant error messages or stacktraces

    We had our staging Faktory instance silently dropping Jobs, while spamming the following warnings:

    2021-09-07T17:12:35.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:35.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:30.244Z,Error running task Busy: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:30.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:30.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:25.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:25.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:20.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
    2021-09-07T17:12:20.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.
    

    As this was staging and we don't pay close attention to background jobs in that system, we didn't notice and the Faktory instance remained like this for almost all of September.

    I see nothing else in the logs at all (except for the web client if/when we happen to be poking around), no "previous error" I can point at that triggered this state. Clients were receiving no errors on enqueue and in fact getting back Job Ids. Restarting the Faktory instance fixed it.

    Have you seen this before?

    I wouldn't be surprised if this isn't worth investigating, but I wonder if this message should be elevated to ERROR if it represents the instance silently not processing work.

  • Support batch death

    Support batch death

    Sidekiq added the notion of batch :death callback to capture the state where one or more batch jobs die and the batch will never succeed. Port this notion to Faktory Enterprise.

  • Provide remote configuration API

    Provide remote configuration API

    Faktory's conf.d filesystem configuration is looking increasingly dated as more users and customers use Docker and containers where the local filesystem is not easy to provide or change. Instead, provide a Web API where TOML files can be POST'd via curl by an application deployment. Something like:

    for x in config/faktory/*.toml; do
      curl -T $x http://localhost:7420/config/upload/
    end && curl -X POST http://localhost:7420/config/reload
    

    (Basic auth elided for readability)

    That curl command will POST foo.toml to /config/update/foo.toml. After all of the TOML are loaded, we call reload to have Faktory atomically reload its config.

    Configuration will be stored in Redis so it persists across Faktory boots. The config will track last modification time and SHAs for each config item in order to minimize unnecessary changes or reloads.

    1. Uploads will return an error if the TOML is malformed.
    2. Uploads will stage into a temporary Redis key. Calling reload will swap with the live config, just like calling HUP today. Otherwise the temp key should expire after N minutes.
    3. For backwards compatibility, an existing Faktory at boot will load conf.d files into Redis if any exist and their modification time is > the last config update modtime.
    4. Partial config update is not allowed. Every TOML file must be POST'd before reload. This is to prevent renaming a TOML and having the old filename hang around in Redis. For maximum reliability, POST one big TOML.

    For sanity's sake, it is strongly recommended to use either conf.d or /config/upload exclusively. Don't mix the two. The Web UI will not provide a manual upload page as this feature is intended for deployment automation.

Executes jobs in separate GO routines. Provides Timeout, StartTime controls. Provides Cancel all running job before new job is run.

jobExecutor Library to execute jobs in GO routines. Provides for Job Timeout/Deadline (MaxDuration()) Job Start WallClock control (When()) Add a job b

Jan 10, 2022
Efficient and reliable background processing for Go

CurlyQ CurlyQ provides a simple, easy-to-use interface for performing background processing in Go. It supports scheduled jobs, job deduplication, and

Nov 11, 2022
Simple, efficient background processing for Golang backed by RabbitMQ and Redis
Simple, efficient background processing for Golang backed by RabbitMQ and Redis

Table of Contents How to Use Motivation Requirements Features Examples Setup Config Client/Server Task Worker/Task Hander Register The Handlers Send t

Nov 10, 2022
Tiny library to handle background jobs.

bgjob Tiny library to handle background jobs. Use PostgreSQL to organize job queues. Highly inspired by gue Features Durable job storage At-least-ones

Nov 16, 2021
clockwork - Simple and intuitive job scheduling library in Go.
clockwork - Simple and intuitive job scheduling library in Go.

clockwork A simple and intuitive scheduling library in Go. Inspired by python's schedule and ruby's clockwork libraries. Example use package main imp

Jul 27, 2022
You had one job, or more then one, which can be done in steps

Leprechaun Leprechaun is tool where you can schedule your recurring tasks to be performed over and over. In Leprechaun tasks are recipes, lets observe

Nov 23, 2022
Job scheduling made easy.

scheduler Job scheduling made easy. Scheduler allows you to schedule recurrent jobs with an easy-to-read syntax. Inspired by the article Rethinking Cr

Dec 30, 2022
goCron: A Golang Job Scheduling Package.

goCron: A Golang Job Scheduling Package.

Jan 9, 2023
A programmable, observable and distributed job orchestration system.
A programmable, observable and distributed job orchestration system.

?? Overview Odin is a programmable, observable and distributed job orchestration system which allows for the scheduling, management and unattended bac

Dec 21, 2022
Machinery is an asynchronous task queue/job queue based on distributed message passing.
Machinery is an asynchronous task queue/job queue based on distributed message passing.

Machinery Machinery is an asynchronous task queue/job queue based on distributed message passing. V2 Experiment First Steps Configuration Lock Broker

Dec 24, 2022
Run Jobs on a schedule, supports fixed interval, timely, and cron-expression timers; Instrument your processes and expose metrics for each job.

A simple process manager that allows you to specify a Schedule that execute a Job based on a Timer. Schedule manage the state of this job allowing you to start/stop/restart in concurrent safe way. Schedule also instrument this Job and gather metrics and optionally expose them via uber-go/tally scope.

Dec 8, 2022
Simple job queues for Go backed by Redis
Simple job queues for Go backed by Redis

bokchoy Introduction Bokchoy is a simple Go library for queueing tasks and processing them in the background with workers. It should be integrated in

Dec 13, 2022
golang job dispatcher
golang job dispatcher

go-gearman The shardingkey is hashed to the same queue, each of which is bound to a worker.

Dec 28, 2022
Job worker service that provides an API to run arbitrary Linux processes.
Job worker service that provides an API to run arbitrary Linux processes.

Job Scheduler Summary Prototype job worker service that provides an API to run arbitrary Linux processes. Overview Library The library (Worker) is a r

May 26, 2022
A simple job scheduler backed by Postgres.

A simple job scheduler backed by Postgres used in production at https://operand.ai. Setup needs two environment variables, SECRET and ENDPOINT. The se

Sep 10, 2022
goInterLock is golang job/task scheduler with distributed locking mechanism (by Using Redis🔒).
goInterLock is golang job/task scheduler with distributed locking mechanism (by Using Redis🔒).

goInterLock is golang job/task scheduler with distributed locking mechanism. In distributed system locking is preventing task been executed in every instant that has the scheduler,

Dec 5, 2022
xxl-job 对应的golang客户端

xxl-job-go-client xxl-job 对应的golang客户端 提供Elasticsearch 日志组件,把job执行过程写入elasticsearch方便跟踪查询 func main() { exec := xxl.NewExecutor( xxl.ServerAd

Aug 26, 2022
a self terminating concurrent job queue for indeterminate workloads in golang

jobtracker - a self terminating concurrent job queue for indeterminate workloads in golang This library is primarily useful for technically-recursive

Sep 6, 2022
A zero-dependencies and lightweight go library for job scheduling

A zero-dependencies and lightweight go library for job scheduling.

Aug 3, 2022