Language-agnostic persistent background job server

Last update: Jan 1, 2023

Comments: 16

Faktory

At a high level, Faktory is a work server. It is the repository for background jobs within your application. Jobs have a type and a set of arguments and are placed into queues for workers to fetch and execute.

You can use this server to distribute jobs to one or hundreds of machines. Jobs can be executed with any language by clients using the Faktory API to fetch a job from a queue.

Basic Features

Jobs are represented as JSON hashes.
Jobs are pushed to and fetched from queues.
Jobs are reserved with a timeout, 30 min by default.
Jobs FAIL'd or not ACK'd within the reservation timeout are requeued.
FAIL'd jobs trigger a retry workflow with exponential backoff.
Contains a comprehensive Web UI for management and monitoring.

Installation

See the Installation wiki page for current installation methods. Here's more info on installation with Docker and AWS ECS.

Documentation

Please see the Faktory wiki for full documentation.

Support

You can find help in the contribsys/faktory chat channel. Stop by and say hi!

Author

Mike Perham, @getajobmike, mike @ contribsys.com

Owner

Contributed Systems

The company behind the Sidekiq and Faktory job systems

https://github.com/contribsys/faktory https://contribsys.com/faktory/

Comments

Cron not running

Cron's have suddenly stopped running. Nothing in the Faktory server logs.

The jobs don't get created at all. In fact, i make sure the queue they are going into does not exist, and it doesn't even create the queue.

I don't know what to give you here to debug, since the issue is that just nothing happens. The cron countdown happens on the cron UI, but then it just goes to the next run.

About to restart the Faktory server.
Worker redeploy not sending jobs back to queue + process showing as active when its actually been killed

We had a job running on our 'dashboard-faktory" process, and I stopped that container (TSTP, TERM, sleep(40), KILL), however, the job was not returned to the queue.

The worker also stayed on the busy page for a good few minutes after it was killed too.

As you can see here, the job is shown as running, and the process is shown as active, however, the actual server process was killed (TSTP, TERM, sleep(40), KILL) well over 1 minute before this.

Eventually, the process disappeared, but now the job is stuck as busy. This will cause huge issues at production scale, as our concurrency limits will be eaten up by jobs that are not actually running.
Kubernetes support?

What should Faktory look like in a world full of Kubernetes? My understanding is that Kubernetes could be very useful in scaling worker processes as queues grow. How can Faktory make this easy?
Build Docker images
Build a Docker image with the following commands:

git clone https://github.com/contribsys/faktory.git cd faktory && git checkout v0.5.0 GOLANG_VERSION=1.9.1 ROCKSDB_VERSION=5.7.3 TAG=0.5.0 docker-compose build

That assumes there is a branch or tag v0.5.0.

That will build an image: contribsys/faktory:0.5.0

Notice that the image tag is set with an env var regardless of what version of Faktory you have checked out. So take care that you set a tag that properly corresponds to what you're building.

If you then do a docker-compose push it will push the image to Docker Hub for anyone to download and use (assuming you're the owner of contribsys on Docker Hub).

Run a container in the foreground like so:

docker run --rm -it -p 7419:7419 -p 7420:7420 contribsys/faktory:0.5.0 -b :7419 -no-tls

Addresses #13.
Workers not receiving any new jobs

Over the past week or so, we have had a large % of our workers not receiving any jobs. The only way we can fix this is by killing the workers and letting them restart (I normally use Stop All in the UI).

There are no errors in the worker logs, or the server logs. Need some help debugging this. I am not able to replicate it either, other than when it randomly happens.
Add client/pool to provide a thread-safe pool of clients

client/pool/pool.go provides a pool of clients and manages their lifecycle appropriately.

This is the initial implementation and may not be exactly what you want, @mperham, but I'd love to go back and forth to arrive at a final implementation.
Home tab becomes unresponsive after the app runs for a while

We're running faktory 0.7.0 inside of docker with both node.js and golang workers.

After restarting the faktory service, every tab (Home, Busy, Queues, etc.) are snappy; however, after a short while the Home tab begins to load slower and slower until it doesn't load at all.

Jobs continue to process just fine. The only thing noticed is the issue with the home tab.

Anything we should be checking on our end? Thanks

Closed connections and timeouts on PUSH and FETCH

We're experiencing sporadic timeouts and connection resets on both PUSH and FETCH, from Python and Node.

Which Faktory package and version?
- Pro 1.4.0
Which Faktory worker package and version?
- faktory-worker-node, 3.3.7 and @next
- faktory-worker-python, 0.4
Please include any relevant worker configuration
- Workers are running on Heroku, Faktory is running on AWS EKS. Traffic thus goes over the internet from worker <-> server. The python worker connects over tcp+tls. The node worker connects to a local stunnel instance which then maintains the connection to the server (same setup as connecting to Redis over TLS for heroku). See the stunnel conf here.

Please include any relevant error messages or stacktraces

Node:

Closed connections:

Error: Connection closed
  at Connection.onClose (/usr/src/api/node_modules/faktory-worker/lib/connection.js:120:41)
  at /usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:54:19
  at Scope._activate (/usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/async_hooks.js:51:14)
  at Scope.activate (/usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:12:19)
  at Socket.bound (/usr/src/api/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:53:20)
  at Socket.emit (events.js:310:20)
  at Socket.EventEmitter.emit (domain.js:482:12)
  at TCP.<anonymous> (net.js:672:12)

Resets:

at TCP.onStreamRead (internal/stream_base_commons.js:200:27)
Error: read ECONNRESET

Timeouts:

at processTimers (internal/timers.js:475:7)
at listOnTimeout (internal/timers.js:531:17)
at Timeout.bound (/app/node_modules/generic-pool/lib/ResourceRequest.js:8:15)
at ResourceRequest._fireTimeout (/app/node_modules/generic-pool/lib/ResourceRequest.js:62:17)
TimeoutError: ResourceRequest timed out

stunnel:

2020.08.06 04:44:23 LOG5[7333]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
2020.08.06 04:44:23 LOG3[7333]: No more addresses to connect
2020.08.06 04:44:23 LOG3[7333]: s_connect: s_poll_wait 174.129.155.89:7419: TIMEOUTconnect exceeded

Python:

timeout: timed out
File "inventory/common/faktory/worker.py", line 32, in __call__
  return self._task.func(*args)
[...snip...]
File "inventory/worker/client.py", line 28, in queue
  return _client.queue(*args, **kwargs)
File "inventory/common/faktory/client.py", line 66, in queue
  with faktory.connection() as client:
File "contextlib.py", line 112, in __enter__
  return next(self.gen)
File "__init__.py", line 18, in connection
  c.connect()
File "faktory/client.py", line 21, in connect
  self.is_connected = self.faktory.connect()
File "faktory/_proto.py", line 64, in connect
  self.socket.connect((self.host, self.port))
File "ssl.py", line 1172, in connect
  self._real_connect(addr, False)
File "ssl.py", line 1159, in _real_connect
  super().connect(addr)
```

Faktory server: Seeing a bunch of Bad connection: read tcp 10.0.78.94:7419->10.0.70.174:52255: i/o timeout

Tests:
- I performed a load test using Locust, running locally. The job was a simple ping job with a payload of random length which just prints "pong" and then schedules another ping job (to send sending a job within a job). The Faktory server was running in the same k8s cluster as the one above with the following requests/limits: 550m/1500m CPU and 250Mi/300Mi Memory. The worker ran locally on my machine (to simulate the traffic over the internet). Results:
  - 50 concurrent users: 33 failures (connection reset by peer) out of 11268 pushes (0.3%). Average throughput: 23
  - 150 concurrent users: 41 failures out of 13665 pushes. Average throughput: 60/s
Are you using an old version?
- No
Have you checked the changelogs to see if your issue has been fixed in a later version?
- Yes

Evaluate Badger

RocksDB is a pretty painful dependency due to C++: it forces a more complex build, makes finding memory leaks much harder, and is a black-box when it comes to profiling and tuning.

Badger is a new storage layer written in Go which explicitly bills itself as a competitor to RocksDB. Right now the load test (make load) reports about 5000 job/sec. If we can stay near that or better, I'd seriously consider moving.

https://github.com/dgraph-io/badger
Add .travis.yml
So, there are some dragons in building this on Travis:

As mentioned here: https://github.com/contribsys/faktory/issues/23, rocksdb appears to be trying to build in compression support if any compression headers are found (so we remove zlib and libbz2 dev packages)

Travis moves the checkout for (forked) branches based off the repo it pulls from which breaks dependencies unless you move it to $GOPATH/src/github.com/contribsys/faktory
Job Priority
So, I ended up finding an implementation of brodal queues that seem to work nicely here, refactored the interface a bit but threw the original license in the source.

I wrote a simple go program for loading data into faktory so I could compare load times and memory profiles. The queue priority structures add pretty much no memory overhead and capping the priority levels to 10, have pretty much the same performance as without the prioritization for enqueuing and processing.

Here are the runtimes for faktory without job prioritization using the script in test/load/main.go:

➜ load git:(priority) ✗ ./load push 100000 Enqueued 100000 jobs in 6.870994 seconds, rate: 14553.935072 jobs/s ➜ load git:(priority) ✗ ./load pop 100000 Processed 100000 jobs in 17.172457 seconds, rate: 5823.278658 jobs/s

Here are the runtimes for faktory with job prioritization:

➜ load git:(priority) ✗ ./load push 100000 Enqueued 100000 jobs in 6.994321 seconds, rate: 14297.312592 jobs/s ➜ load git:(priority) ✗ ./load pop 100000 Processed 100000 jobs in 16.859516 seconds, rate: 5931.368221 jobs/s

Here's a screenshot of the debug page with 100k jobs enqueued with faktory sans prioritization:

And here's with prioritization for 100k:

Finally, here's with the backpressure bumped up to 1 mil jobs in the queue:

I also started to clean up a bit of the typing around signed v. unsigned integers for things like queue size (since you should never have a negative queue size). Still more work to be done, but figure that can come in another PR. Also, it makes key generation a lot more simple to reason around since you can use binary.BigEndian.PutUint64 instead of doing the bit shifting manually.

Let me know what you think.
Can't filter/search retrying jobs
Which Faktory package and version? faktory-ent 1.6.1

When there are many retry jobs, it's impossible to selectively delete jobs of a certain type (e.g., by searching/filtering), and the pagination UI does not make it easy either. This puts us in a very difficult position when there are thousands of jobs failing that should be killed among jobs that should not be killed.

Sidekiq Enterprise allows filtering, which makes it easier. Also the plugin model enables one to do more.

How can this use case be handled with Faktory?
Create benchmark suite
We need a set of benchmarks which can exercise various features and provide some basic performance numbers.

push 100,000 jobs

Bulk push 100,000 jobs

Process 200,000 no-op jobs using fwg

process a batch of 100 child batches with 2000 jobs each

etc
Silently dropping Jobs with transaction discarded warnings
Which Faktory package and version?

docker.contribsys.com/contribsys/faktory-ent:1.4.0

Which Faktory worker package and version?

https://hackage.haskell.org/package/faktory-1.1.1.0

Please include any relevant worker configuration

Please include any relevant error messages or stacktraces

We had our staging Faktory instance silently dropping Jobs, while spamming the following warnings:

2021-09-07T17:12:35.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:35.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:30.244Z,Error running task Busy: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:30.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:30.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:25.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:25.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:20.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors. 2021-09-07T17:12:20.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.

As this was staging and we don't pay close attention to background jobs in that system, we didn't notice and the Faktory instance remained like this for almost all of September.

I see nothing else in the logs at all (except for the web client if/when we happen to be poking around), no "previous error" I can point at that triggered this state. Clients were receiving no errors on enqueue and in fact getting back Job Ids. Restarting the Faktory instance fixed it.

Have you seen this before?

I wouldn't be surprised if this isn't worth investigating, but I wonder if this message should be elevated to ERROR if it represents the instance silently not processing work.
Support batch death

Sidekiq added the notion of batch :death callback to capture the state where one or more batch jobs die and the batch will never succeed. Port this notion to Faktory Enterprise.
Provide remote configuration API
Faktory's conf.d filesystem configuration is looking increasingly dated as more users and customers use Docker and containers where the local filesystem is not easy to provide or change. Instead, provide a Web API where TOML files can be POST'd via curl by an application deployment. Something like:

for x in config/faktory/*.toml; do curl -T $x http://localhost:7420/config/upload/ end && curl -X POST http://localhost:7420/config/reload

(Basic auth elided for readability)

That curl command will POST foo.toml to /config/update/foo.toml. After all of the TOML are loaded, we call reload to have Faktory atomically reload its config.

Configuration will be stored in Redis so it persists across Faktory boots. The config will track last modification time and SHAs for each config item in order to minimize unnecessary changes or reloads.

Uploads will return an error if the TOML is malformed.

Uploads will stage into a temporary Redis key. Calling reload will swap with the live config, just like calling HUP today. Otherwise the temp key should expire after N minutes.

For backwards compatibility, an existing Faktory at boot will load conf.d files into Redis if any exist and their modification time is > the last config update modtime.

Partial config update is not allowed. Every TOML file must be POST'd before reload. This is to prevent renaming a TOML and having the old filename hang around in Redis. For maximum reliability, POST one big TOML.

For sanity's sake, it is strongly recommended to use either conf.d or /config/upload exclusively. Don't mix the two. The Web UI will not provide a manual upload page as this feature is intended for deployment automation.

Related tags

Job Scheduler faktory

Executes jobs in separate GO routines. Provides Timeout, StartTime controls. Provides Cancel all running job before new job is run.

jobExecutor Library to execute jobs in GO routines. Provides for Job Timeout/Deadline (MaxDuration()) Job Start WallClock control (When()) Add a job b

Jan 10, 2022

Efficient and reliable background processing for Go

CurlyQ CurlyQ provides a simple, easy-to-use interface for performing background processing in Go. It supports scheduled jobs, job deduplication, and

Nov 11, 2022

Simple, efficient background processing for Golang backed by RabbitMQ and Redis

Table of Contents How to Use Motivation Requirements Features Examples Setup Config Client/Server Task Worker/Task Hander Register The Handlers Send t

Nov 10, 2022

Tiny library to handle background jobs.

bgjob Tiny library to handle background jobs. Use PostgreSQL to organize job queues. Highly inspired by gue Features Durable job storage At-least-ones

Nov 16, 2021

clockwork - Simple and intuitive job scheduling library in Go.

clockwork A simple and intuitive scheduling library in Go. Inspired by python's schedule and ruby's clockwork libraries. Example use package main imp

Jul 27, 2022

You had one job, or more then one, which can be done in steps

Leprechaun Leprechaun is tool where you can schedule your recurring tasks to be performed over and over. In Leprechaun tasks are recipes, lets observe

Nov 23, 2022

Job scheduling made easy.

scheduler Job scheduling made easy. Scheduler allows you to schedule recurrent jobs with an easy-to-read syntax. Inspired by the article Rethinking Cr

Dec 30, 2022

goCron: A Golang Job Scheduling Package.

Jan 9, 2023

A programmable, observable and distributed job orchestration system.

?? Overview Odin is a programmable, observable and distributed job orchestration system which allows for the scheduling, management and unattended bac

Dec 21, 2022

Machinery is an asynchronous task queue/job queue based on distributed message passing.

Machinery Machinery is an asynchronous task queue/job queue based on distributed message passing. V2 Experiment First Steps Configuration Lock Broker

Dec 24, 2022

Run Jobs on a schedule, supports fixed interval, timely, and cron-expression timers; Instrument your processes and expose metrics for each job.

A simple process manager that allows you to specify a Schedule that execute a Job based on a Timer. Schedule manage the state of this job allowing you to start/stop/restart in concurrent safe way. Schedule also instrument this Job and gather metrics and optionally expose them via uber-go/tally scope.

Dec 8, 2022

Language-agnostic persistent background job server

Faktory

Basic Features

Installation

Documentation

Support

Author

Owner

Contributed Systems

Comments

Cron not running

Worker redeploy not sending jobs back to queue + process showing as active when its actually been killed

Kubernetes support?

Build Docker images

Workers not receiving any new jobs

Add client/pool to provide a thread-safe pool of clients

Home tab becomes unresponsive after the app runs for a while

Closed connections and timeouts on PUSH and FETCH

Evaluate Badger

Add .travis.yml

Job Priority

Can't filter/search retrying jobs

Create benchmark suite

Silently dropping Jobs with transaction discarded warnings

Support batch death

Provide remote configuration API

Related tags

Executes jobs in separate GO routines. Provides Timeout, StartTime controls. Provides Cancel all running job before new job is run.

Efficient and reliable background processing for Go

Simple, efficient background processing for Golang backed by RabbitMQ and Redis

Tiny library to handle background jobs.

clockwork - Simple and intuitive job scheduling library in Go.

You had one job, or more then one, which can be done in steps

Job scheduling made easy.

goCron: A Golang Job Scheduling Package.

A programmable, observable and distributed job orchestration system.

Machinery is an asynchronous task queue/job queue based on distributed message passing.

Run Jobs on a schedule, supports fixed interval, timely, and cron-expression timers; Instrument your processes and expose metrics for each job.

Simple job queues for Go backed by Redis

golang job dispatcher

Job worker service that provides an API to run arbitrary Linux processes.

A simple job scheduler backed by Postgres.

goInterLock is golang job/task scheduler with distributed locking mechanism (by Using Redis🔒).

xxl-job 对应的golang客户端

a self terminating concurrent job queue for indeterminate workloads in golang

A zero-dependencies and lightweight go library for job scheduling