fleet ties together systemd and etcd into a distributed init system

Last update: Dec 9, 2022

Comments: 15

Deprecation warning

fleet is no longer developed or maintained by CoreOS. After February 1, 2018, a fleet container image will continue to be available from the CoreOS Quay registry, but will not be shipped as part of Container Linux. CoreOS instead recommends Kubernetes for all clustering needs.

The project exists here for historical reference. If you are interested in the future of the project and taking over stewardship, please contact [email protected]

fleet - a distributed init system

fleet ties together systemd and etcd into a simple distributed init system. Think of it as an extension of systemd that operates at the cluster level instead of the machine level.

This project is quite low-level, and is designed as a foundation for higher order orchestration. fleet is a cluster-wide elaboration on systemd units, and is not a container manager or orchestration system. fleet supports basic scheduling of systemd units across nodes in a cluster. Those looking for more complex scheduling requirements or a first-class container orchestration system should check out Kubernetes. The fleet and kubernetes comparison table has more information about the two systems.

Current status

The fleet project is no longer maintained.

As of v1.0.0, fleet has seen production use for some time and is largely considered stable. However, there are various known and unresolved issues, including scalability limitations with its architecture. As such, it is not recommended to run fleet clusters larger than 100 nodes or with more than 1000 services.

Using fleet

Launching a unit with fleet is as simple as running fleetctl start:

$ fleetctl start examples/hello.service
Unit hello.service launched on 113f16a7.../172.17.8.103

The fleetctl start command waits for the unit to get scheduled and actually start somewhere in the cluster. fleetctl list-unit-files tells you the desired state of your units and where they are currently scheduled:

$ fleetctl list-unit-files
UNIT            HASH     DSTATE    STATE     TMACHINE
hello.service   e55c0ae  launched  launched  113f16a7.../172.17.8.103

fleetctl list-units exposes the systemd state for each unit in your fleet cluster:

$ fleetctl list-units
UNIT            MACHINE                    ACTIVE   SUB
hello.service   113f16a7.../172.17.8.103   active   running

Supported Deployment Patterns

fleet is not intended to be an all-purpose orchestration system, and as such supports only a few simple deployment patterns:

Deploy a single unit anywhere on the cluster
Deploy a unit globally everywhere in the cluster
Automatic rescheduling of units on machine failure
Ensure that units are deployed together on the same machine
Forbid specific units from colocation on the same machine (anti-affinity)
Deploy units to machines only with specific metadata

These patterns are all defined using custom systemd unit options.

Getting Started

Before you can deploy units, fleet must be deployed and configured on each host in your cluster. (If you are running CoreOS, fleet is already installed.)

After you have machines configured (check fleetctl list-machines), get to work with the client.

Building

fleet must be built with Go 1.5+ on a Linux machine. Simply run ./build and then copy the binaries out of bin/ directory onto each of your machines. The tests can similarly be run by simply invoking ./test.

If you're on a machine without Go 1.5+ but you have Docker installed, run ./build-docker to compile the binaries instead.

Project Details

API

The fleet API uses JSON over HTTP to manage units in a fleet cluster. See the API documentation for more information.

Release Notes

See the releases tab for more information on each release.

License

fleet is released under the Apache 2.0 license. See the LICENSE file for details.

Specific components of fleet use code derivative from software distributed under other licenses; in those cases the appropriate licenses are stipulated alongside the code.

Owner

CoreOS

Key components to secure, simplify and automate your container infrastructure

https://github.com/coreos/fleet

Comments

Something wrong with fleet 0.3.1 in CoreOS master 315.0.0+2014-05-13-2126

core@core-3 ~ $ fleetctl list-units
UNIT                STATE       LOAD    ACTIVE  SUB DESC            MACHINE
etcd-amb-redis.service      launched    loaded  active  running Ambassador on A     d8695a82.../192.168.65.4
etcd-amb-redis2.service     launched    loaded  active  running Ambassador on B     7382eb69.../192.168.65.2
redis-demo.service      inactive    -   -   -   Redis on A      -
redis-docker-reg.service    launched    loaded  active  running Register on A       d8695a82.../192.168.65.4
redis-dyn-amb.service       launched    loaded  active  running Etcd Ambassador on B    7382eb69.../192.168.65.2
core@core-3 ~ $ systemctl status redis-demo.service -l
● redis-demo.service - Redis on A
   Loaded: loaded (/run/fleet/units/redis-demo.service; linked-runtime)
   Active: active (running) since Tue 2014-05-13 23:01:36 UTC; 13min ago
  Process: 13338 ExecStartPre=/usr/bin/docker pull crosbymichael/redis (code=exited, status=0/SUCCESS)
 Main PID: 13610 (docker)
   CGroup: /system.slice/redis-demo.service
           └─13610 /usr/bin/docker run --rm --name redis-demo.service -p 192.168.65.4::6379 crosbymichael/redis

May 13 23:01:36 core-3 docker[13610]: `-._    `-._`-.__.-'_.-'    _.-'
May 13 23:01:36 core-3 docker[13610]: |`-._`-._    `-.__.-'    _.-'_.-'|
May 13 23:01:36 core-3 docker[13610]: |    `-._`-._        _.-'_.-'    |
May 13 23:01:36 core-3 docker[13610]: `-._    `-._`-.__.-'_.-'    _.-'
May 13 23:01:36 core-3 docker[13610]: `-._    `-.__.-'    _.-'
May 13 23:01:36 core-3 docker[13610]: `-._        _.-'
May 13 23:01:36 core-3 docker[13610]: `-.__.-'
May 13 23:01:36 core-3 docker[13610]: [1] 13 May 23:01:36.592 # Server started, Redis version 2.8.8
May 13 23:01:36 core-3 docker[13610]: [1] 13 May 23:01:36.592 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
May 13 23:01:36 core-3 docker[13610]: [1] 13 May 23:01:36.593 * The server is now ready to accept connections on port 6379

WIP: feat(engine): scheduling according to current cluster load

Introduces a job control object with the ability to schedule jobs according to current load in the cluster. Considers memory, cores and local disk job requirements and the same for load. Has clear defined dependencies on how to hook it into fleet and operate it (the hookup hasn't been done in this PR).

Follow-up commits in this PR will flesh out the unit testing.
Dynamic Metadata
Updated 2/28/2015

Added new PATCH method to the machines collection endpoint to allow changing a machines metadata via the http endpoint.

API uses the jsonpatch format. Multiple machines can be modified at once, including machines that are not currently part of the cluster. "add", "replace", and "remove" operations are available.

Metadata modified via the api is retained even if the machine leaves and rejoins the cluster.

Dynamic metadata (metadata set by the user via the api) is merged with Machine metadata (metadata defined by the fleet config, env variables, or flags) using the following rules

Any keys that exist in only one of the two collections are added to the final collection as-is

Any keys that exist in both, the value set in the dynamic metadata is added to the final collection.

Any keys that are the string zero-value in the dynamic metadata are considered deleted and are not included in the final collection. This lets a user persistently delete a value set at configuration time, even if the machine leaves and rejoins the cluster.

The reconcile will reschedule any units on machines that no longer meet the metadata requirements (pre-existing functionality)

Known Issues:

Docs not complete
fleet: add replace unit support

This PR allows units to be replaced with "submit", "load" and "start" commands. Just add the new "--replace" switch.

The previous discussion was about overwrite in this PR https://github.com/coreos/fleet/pull/1295

This PR tries to fix: https://github.com/coreos/fleet/issues/760
Use gRPC to communicate the engine and agents

This PR aims to provide a new communication mechanism to improve the performance, data transmission and unit state sharing between the fleet engine and agents in a fleet cluster.

Motivation: In our infrastructure, we have experienced some issues with fleet in terms of scalability, performance and fault-tolerance. Therefore we'd like to present our ideas to help improve those areas.

We use gRPC/HTTP2 as framework to expose all the required operations (schedule unit, destroy unit, save state, ...) that will allow to coordinate the engine (the fleet node elected as leader) with the agents. In this implementation, we provide a new registry that stores all the information in-memory. Nevertheless, it's also possible to use the etcd registry.

Generally, this implementation provides two solutions as mentioned above. You can use etcd if that fits better your architecture or requirements. OR you can use the in-memory registry to reduce the dependencies with etcd (but not to avoid using etcd). In that direction, we found out in our infrastructure that a high workload over etcd induces into a poor or wrong behavior of fleet. Besides, we believe that the use of etcd to provide inter-process communication for the agents, it could end up into potential bottlenecks, as well as it has an impact into the fault tolerance of fleet.

Additional information and plots about the motivation of this PR can be found below: https://github.com/coreos/fleet/pull/1426#issuecomment-181778260

This PR has been labeled as WIP, we are still working on improvements, fault tolerance, bug fixing, etc :)

NOTE: If you want to try this PR, you need to rebuild the Go dependencies, and preferably to use Go v1.5.1. This PR was so big that we were forced to exclude our new Go dependencies.
Add Consul support

Consul comes with health checks, multi-datacenter support and other very nice features. Many people prefer it over etcd, so it would be great if fleet had Consul support.

Fleet is restarting on heavy load

When I run fleet on an heavy load (I start two services every second) it restarts really often (more than once per minute).

What I see is that it kills all the running services and it restarts them once again. The services disappear also from the fleetctl list-units while it is restarting.

I have installed CoreOS alpha in 3 bare metal servers. One of this has a ping time of 2ms and the other 0.2.

Etcd has the default settings and the cloud-init config file is quite standard

Those are the logs from etcd

Sep 26 21:49:39 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:39.462 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:39 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:39.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m20.399085725s ago
Sep 26 21:49:40 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:40.095 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:40 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:40.466 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
Sep 26 21:49:40 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:40.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m21.399033659s ago
Sep 26 21:49:41 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:41.237 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:41 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:41.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m22.399033018s ago
Sep 26 21:49:42 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:42.310 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:42 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:42.625 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=2 backoff="4s"
Sep 26 21:49:42 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:42.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m23.399102714s ago
Sep 26 21:49:43 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:43.519 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:43 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:43.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m24.399055692s ago
Sep 26 21:49:44 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:44.153 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:44 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:44.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m25.399057728s ago
Sep 26 21:49:45 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:45.361 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:45 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:45.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m26.399044122s ago
Sep 26 21:49:46 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:46.143 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:46 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:46.974 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=4 backoff="8s"
Sep 26 21:49:46 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:46.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m27.398971006s ago
Sep 26 21:49:47 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:47.360 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:47 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:47.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m28.399056164s ago
Sep 26 21:49:48 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:48.235 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:48 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:48.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m29.399041092s ago
Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.136 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.136 INFO      | core_188_165_248_226: removing node: ; last activity 43m30.556753513s ago
Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.385 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m31.399012358s ago
Sep 26 21:49:51 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:51.118 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:51 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:51.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m32.399010584s ago
Sep 26 21:49:52 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:52.294 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:52 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:52.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m33.399094304s ago
Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.121 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.121 INFO      | core_188_165_248_226: removing node: ; last activity 43m34.540921885s ago
Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.377 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.980 INFO      | core_188_165_248_226: removing node: ; last activity 43m35.400533653s ago
Sep 26 21:49:55 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:55.101 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:55 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:55.580 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
Sep 26 21:49:55 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:55.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m36.399080664s ago
Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.623 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.623 INFO      | core_188_165_248_226: removing node: ; last activity 43m38.043724209s ago
Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.818 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m38.399247326s ago
Sep 26 21:49:58 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:58.202 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:58 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:58.515 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
Sep 26 21:49:58 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:58.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m39.398997676s ago
Sep 26 21:49:59 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:59.445 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:49:59 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:59.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m40.399139946s ago
Sep 26 21:50:00 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:00.218 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:00 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:00.600 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=3 backoff="4s"
Sep 26 21:50:00 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:00.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m41.399177445s ago
Sep 26 21:50:01 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:01.409 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:01 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:01.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m42.399013462s ago
Sep 26 21:50:02 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:02.280 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:02 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:02.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m43.399042518s ago
Sep 26 21:50:03 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:03.101 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:03 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:03.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m44.399032916s ago
Sep 26 21:50:04 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:04.699 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:04 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:04.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m45.399022199s ago
Sep 26 21:50:05 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:05.147 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
Sep 26 21:50:05 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:05.360 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:05 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:05.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m46.399110631s ago
Sep 26 21:50:06 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:06.251 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:06 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:06.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m47.399128338s ago
Sep 26 21:50:07 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:07.531 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
Sep 26 21:50:09 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:09.632 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=6 backoff="4s"
Sep 26 21:50:13 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:13.836 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=17 backoff="8s"
Sep 26 21:50:14 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:14.116 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:14 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:14.116 INFO      | core_188_165_248_226: removing node: ; last activity 43m54.536329299s ago
Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.247 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.247 INFO      | core_188_165_248_226: removing node: ; last activity 43m56.666983986s ago
Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.565 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m57.399109068s ago
Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.251 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.251 INFO      | core_188_165_248_226: removing node: ; last activity 44m1.671675901s ago
Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.651 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.979 INFO      | core_188_165_248_226: removing node: ; last activity 44m2.398967504s ago
Sep 26 21:50:22 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:22.325 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:22 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:22.979 INFO      | core_188_165_248_226: removing node: ; last activity 44m3.399087729s ago
Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.626 INFO      | core_188_165_248_226: snapshot of 10109 events at index 29152983 completed
Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.629 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.637 INFO      | core_188_165_248_226: state changed from 'leader' to 'follower'.
Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.637 INFO      | core_188_165_248_226: term #4216 started.
Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.637 INFO      | core_188_165_248_226: leader changed from 'core_188_165_248_226' to ''.
Sep 26 21:53:53 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:53:53.056 INFO      | core_188_165_248_226: snapshot of 10001 events at index 29162984 completed
Sep 26 21:57:53 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:57:53.385 INFO      | core_188_165_248_226: snapshot of 10126 events at index 29173110 completed

And those from fleet:

Sep 26 21:49:00 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
Sep 26 21:49:00 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:14 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
Sep 26 21:49:14 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:16 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
Sep 26 21:49:16 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:20 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
Sep 26 21:49:20 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled

So the issue is caused by etcd that becomes unreachable. Should I raise the etcd timeouts?

improve experience working with template units

It should not be possible for template units to be scheduled to a system. Right now the experience is not great; it will be scheduled, but then cause chronic issues with the agent on that machine, e.g.:

Oct 14 17:26:41 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name [email protected] is not valid.
Oct 14 17:26:42 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name [email protected] is not valid.
Oct 14 17:26:43 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name [email protected] is not valid.

(Really, any unit with a bad name should never be scheduled; but fleetctl should now block all bad names except for template units).

Related: #541

Cannot destroy and resubmit units on same host

When trying to destroy and resubmit units on the same host, fleet gets confused and units wind up in a not-found and failed state.

Note the session below is using fleet 0.3-rc1, but similar behavior has existed in 0.2:

$ fleetctl --version
fleetctl version 0.3.0-rc.1
$ fleetctl destroy deis-cache.service
Destroyed Job deis-cache.service
$ fleetctl list-units
UNIT    STATE   LOAD    ACTIVE  SUB DESC    MACHINE
$ fleetctl submit cache/systemd/deis-cache.service
$ fleetctl list-units
UNIT            STATE       LOAD        ACTIVE  SUB DESC        MACHINE
deis-cache.service  inactive    not-found   failed  failed  deis-cache  951a306d.../172.17.8.100
$ fleetctl status deis-cache.service
● deis-cache.service
   Loaded: not-found (Reason: No such file or directory)
   Active: failed (Result: exit-code) since Wed 2014-05-07 17:43:28 UTC; 21min ago
 Main PID: 3506 (code=exited, status=1/FAILURE)

May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.133 # User requested shutdown...
May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.133 * Saving the final RDB snapshot before exiting.
May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.135 * DB saved on disk
May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.135 # Redis is now ready to exit, bye bye...
May 07 17:43:28 deis-1 systemd[1]: deis-cache.service: main process exited, code=exited, status=1/FAILURE
May 07 17:43:28 deis-1 docker[16081]: deis-cache
May 07 17:43:28 deis-1 systemd[1]: Stopped deis-cache.
May 07 17:43:28 deis-1 systemd[1]: Unit deis-cache.service entered failed state.
May 07 17:55:36 deis-1 systemd[1]: Stopped deis-cache.service.
May 07 18:02:59 deis-1 systemd[1]: Stopped deis-cache.service.
$ fleetctl cat deis-cache.service
[Unit]
Description=deis-cache

[Service]
EnvironmentFile=/etc/environment
TimeoutStartSec=20m
ExecStartPre=/bin/sh -c "/usr/bin/docker history deis/cache >/dev/null || /usr/bin/docker pull deis/cache"
ExecStartPre=/bin/sh -c "/usr/bin/docker inspect deis-cache >/dev/null && /usr/bin/docker rm -f deis-cache || true"
ExecStart=/bin/sh -c "docker run --name deis-cache -p 6379:6379 -e PUBLISH=6379 -e HOST=$COREOS_PRIVATE_IPV4 deis/cache"
ExecStop=/usr/bin/docker rm -f deis-cache

[Install]
WantedBy=multi-user.target

systemd hides the fact that there exists a maximum unit file line length

Hello,

I think that the ExecStart in the unit files has a limitation with the length of the command. It cuts the command and indicates "Missing '='.

Does this limit exists really ?

This is the error:

Oct 21 22:01:44 ip-172-31-20-195.ec2.internal systemd[1]: [/run/fleet/units/e-88747910-a986-4ed0-9b10-3af9.minion.service:9] Missing '='.
Oct 21 22:01:45 ip-172-31-20-195.ec2.internal systemd[1]: [/run/fleet/units/e-88747910-a986-4ed0-9b10-3af9.minion.service:8] String is not UTF-8 clean, ignoring assignment: /bin/sh -c "while true; do ; /usr/bin/sleep 3; public_ipv4=$COREOS_PUBLIC_IPV4;  e88747910a9864ed09b103af9_8181=$(docker port e-88747910-a986-4ed0-9b10-3af9 8181 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_8101=$(docker port e-88747910-a986-4ed0-9b10-3af9 8101 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_4444=$(docker port e-88747910-a986-4ed0-9b10-3af9 4444 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_5555=$(docker port e-88747910-a986-4ed0-9b10-3af9 5555 | cut -d ':' -f 2 ); etcdctl set /_esb/user/john/instance/e-88747910-a986-4ed0-9b10-3af9 \"{\\\"ID\\\":\\\"e-88747910-a986-4ed0-9b10-3af9\\\",\\\"Labels\\\":{\\\"owner\\\":\\\"john\\\"},\\\"Created\\\":\\\"Tuesday, 21-Oct-14 21:53:59 UTC\\\",\\\"State\\\":\\\"running\\\",\\\"IP\\\":\\\"$public_ipv4\\\",\\\"Configuration\\\":{\\\"CPU_SHARED\\\":\\\"\\\",\\\"ESB\\\":\\\"servicemix\\\",\\\"Features\\\":[\\\"camel-sql\\\"],\\\"ID\\\":\\\"i-124691ed-f613-4ea5-ab6b-05c7\\\",\\\"JBI_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"JBI_corePoolSize\\\":\\\"4\\\",\\\"JBI_keepAliveTime\\\":\\\"60000\\\",\\\"JBI_maximumPoolSize\\\":\\\"-1\\\",\\\"JBI_queueSize\\\":\\\"1024\\\",\\\"JBI_shutdownTimeout\\\":\\\"0\\\",\\\"JVM_MAX_MEM\\\":\\\"\\\",\\\"JVM_MAX_PERM_MEM\\\":\\\"\\\",\\\"JVM_MIN_MEM\\\":\\\"\\\",\\\"JVM_PERM_MEM\\\":\\\"\\\",\\\"MAX_MEM\\\":\\\"\\\",\\\"NMR_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"NMR_corePoolSize\\\":\\\"4\\\",\\\"NMR_keepAliveTime\\\":\\\"60000\\\",\\\"NMR_maximumPoolSize\\\":\\\"-1\\\",\\\"NMR_queueSize\\\":\\\"1024\\\",\\\"NMR_shutdownTimeout\\\":\\\"0\\\",\\\"Name\\\":\\\"\\\",\\\"Ports\\\":{\\\"endpoint1\\\":4444,\\\"endpoint2\\\":5555},\\\"SERVICEMIX_PASSWORD\\\":\\\"smx\\\",\\\"SERVICEMIX_USER\\\":\\\"smx\\\"},\\\"PortBindings\\\":{\\\"4444\\\":\\\"$e88747910a9864ed09b103af9_4444\\\",\\\"5555\\\":\\\"$e88747910a9864ed09b103af9_5555\\\",\\\"8101\\\":\\\"$e88747910a9864ed09b103af9_8101\\\",\\\"8181\\\":\\\"$e88747910a9864ed09b103af9_8181\\\"},\\\"UserAccess\\\":{\\\"Karaf Console\\\":\\\"$public_ipv4$e88747910a9864ed09b103af

And this is the unit file:

[Unit]
Description=Info docker e-88747910-a986-4ed0-9b10-3af9 ESB Instance
After=e-88747910-a986-4ed0-9b10-3af9.service
Requires=e-88747910-a986-4ed0-9b10-3af9.service

[Service]
EnvironmentFile=/etc/environment
ExecStart=/bin/sh -c "while true; do ; /usr/bin/sleep 3; public_ipv4=$COREOS_PUBLIC_IPV4;  e88747910a9864ed09b103af9_8181=$(docker port e-88747910-a986-4ed0-9b10-3af9 8181 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_8101=$(docker port e-88747910-a986-4ed0-9b10-3af9 8101 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_4444=$(docker port e-88747910-a986-4ed0-9b10-3af9 4444 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_5555=$(docker port e-88747910-a986-4ed0-9b10-3af9 5555 | cut -d ':' -f 2 ); etcdctl set /_esb/user/john/instance/e-88747910-a986-4ed0-9b10-3af9 \"{\\\"ID\\\":\\\"e-88747910-a986-4ed0-9b10-3af9\\\",\\\"Labels\\\":{\\\"owner\\\":\\\"john\\\"},\\\"Created\\\":\\\"Tuesday, 21-Oct-14 21:53:59 UTC\\\",\\\"State\\\":\\\"running\\\",\\\"IP\\\":\\\"$public_ipv4\\\",\\\"Configuration\\\":{\\\"CPU_SHARED\\\":\\\"\\\",\\\"ESB\\\":\\\"servicemix\\\",\\\"Features\\\":[\\\"camel-sql\\\"],\\\"ID\\\":\\\"i-124691ed-f613-4ea5-ab6b-05c7\\\",\\\"JBI_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"JBI_corePoolSize\\\":\\\"4\\\",\\\"JBI_keepAliveTime\\\":\\\"60000\\\",\\\"JBI_maximumPoolSize\\\":\\\"-1\\\",\\\"JBI_queueSize\\\":\\\"1024\\\",\\\"JBI_shutdownTimeout\\\":\\\"0\\\",\\\"JVM_MAX_MEM\\\":\\\"\\\",\\\"JVM_MAX_PERM_MEM\\\":\\\"\\\",\\\"JVM_MIN_MEM\\\":\\\"\\\",\\\"JVM_PERM_MEM\\\":\\\"\\\",\\\"MAX_MEM\\\":\\\"\\\",\\\"NMR_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"NMR_corePoolSize\\\":\\\"4\\\",\\\"NMR_keepAliveTime\\\":\\\"60000\\\",\\\"NMR_maximumPoolSize\\\":\\\"-1\\\",\\\"NMR_queueSize\\\":\\\"1024\\\",\\\"NMR_shutdownTimeout\\\":\\\"0\\\",\\\"Name\\\":\\\"\\\",\\\"Ports\\\":{\\\"endpoint1\\\":4444,\\\"endpoint2\\\":5555},\\\"SERVICEMIX_PASSWORD\\\":\\\"smx\\\",\\\"SERVICEMIX_USER\\\":\\\"smx\\\"},\\\"PortBindings\\\":{\\\"4444\\\":\\\"$e88747910a9864ed09b103af9_4444\\\",\\\"5555\\\":\\\"$e88747910a9864ed09b103af9_5555\\\",\\\"8101\\\":\\\"$e88747910a9864ed09b103af9_8101\\\",\\\"8181\\\":\\\"$e88747910a9864ed09b103af9_8181\\\"},\\\"UserAccess\\\":{\\\"Karaf Console\\\":\\\"$public_ipv4$e88747910a9864ed09b103af9_8101\\\",\\\"Web Console\\\":\\\"http://$public_ipv4:$e88747910a9864ed09b103af9_8181/system/console\\\"},\\\"Error\\\":\\\"\\\"}\" --ttl 90; /usr/bin/sleep 30; done"
ExecStop=/usr/bin/echo stopped

[X-Fleet]
MachineOf=e-88747910-a986-4ed0-9b10-3af9.service

Efficient resource utilisation, re-balancing mandate for Engine
I'm looking for a way for improving Engine scheduling, so that it's decisions could based on machine metrics (available CPU, memory) rather than 'number of running units'. For the longer term, I would love to see this exhibit the ability to plug-in different scheduling heuristics (first globally, but maybe later - per unit). As a first step, an MDP if you will, if each machine could report a 'score' on itself, then the engine could just pick the machine with the highest score. This would be a huge improvement over today's scheduling. On the even longer term I'm looking to have a re-balancing feature, where by scheduling would have mandate to re-assign units to other machines. This is probably a separate development thread, but it could build on the previous one I mentioned.

So to give a couple of examples:

My cluster is composed of two single core computers with 2Gb of memory, and one dual core with 4Gb of memory. Fleet would run twice as many units on the stronger computer.

My cluster is running at near capacity. I add a node to the cluster. fleet re-balances the load by moving some units around

Some of my units consume a lot of CPU and memory, and some consume very little. Fleet would schedule units based on their resource utilisation and free capacity.

Note, Even a partial implementation of this approach would give huge advantages, such as - being able to efficiently use machines of different capacities within the same cluster.

This is related to #555, I think #555 is a required step to achieve what I have outlined above.
Fix error formatting based on best practices from Code Review Comments

Use CodeLingo to automatically fix function comments following the Code Review Comments guidelines.

This patch was generated by running the CodeLingo Rewrite Flow over the "go-error-fmt" Tenet. Note: the same Tenet can be used to automate PR reviews and generate contributor docs.

Install CodeLingo to drive Continuous Higher Standards

Learn about CodeLingo
Here $SSHTIMEOUT not display proper value

[Unit] Description=SSH Per-Connection Server After=syslog.target

[Service] EnvironmentFile=-/etc/default/dropbear #ExecStartPre=/ect/scripts/ssh_timeout.sh ExecStartPre=/bin/sh -c 'SSHTIMEOUT=$(/bin/cat /etc/atom/defaults/syscfg_baseline.db | /bin/grep ssh_timeout | /bin/cut -d "=" -f2)' ExecStart=/usr/sbin/dropbear -i -I $SSHTIMEOUT -r /var/tmp/dropbear_rsa_host_key -p 22 $DROPBEAR_EXTRA_ARGS ExecReload=/bin/kill -HUP $MAINPID StandardInput=socket
CoreOs cluster restarted all containers due to fleet or etcd errors
Hello We just saw a pretty server issue on our production CoreOs setup. Details are:

3 CoreOs nodes running in AWS EC2 Us East1

m3.2xlarge instance types

CoreOS nodes - 2 are DISTRIB_RELEASE=1068.2.0 and 1 is at DISTRIB_RELEASE=1081.5.0

etcd version 0.4.9

we have auto update disabled on CoreOs

Around 21:56 UTC on Jan 17 we saw all our containers go down and the logs seemed to suggest an issue with etcd

Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:157: Establishing etcd connectivity Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:179: Engine leadership acquisition failed: context deadline exceeded Jan 17 21:59:41 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:168: Starting server components Jan 17 21:59:42 ip-10-26-31-100.ec2.internal fleetd[999]: INFO engine.go:185: Engine leadership acquired Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(kafka-broker-1.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR reconciler.go:62: Failed resolving task: task={Type: UnscheduleUnit, JobName: kafka-broker-1.service, MachineID: 6ca65ead2f164b2682c0d941c Jan 17 21:59:44 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(newNewApps.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded

We checked the CPU and disk IO for all 3 instances, there is NO indication of any CPU spike per AWS Cloudwatch

ETCD config is as below core@ip-10-26-33-251 ~ $ sudo systemctl cat etcd

/usr/lib64/systemd/system/etcd.service

[Unit] Description=etcd Conflicts=etcd2.service

[Service] User=etcd PermissionsStartOnly=true Environment=ETCD_DATA_DIR=/var/lib/etcd Environment=ETCD_NAME=%m ExecStart=/usr/bin/etcd Restart=always RestartSec=10s LimitNOFILE=40000

/run/systemd/system/etcd.service.d/10-oem.conf

[Service] Environment=ETCD_PEER_ELECTION_TIMEOUT=1200

/run/systemd/system/etcd.service.d/20-cloudinit.conf

[Service] Environment="ETCD_ADDR=10.26.33.251:4001" Environment="ETCD_CERT_FILE=/home/etcd/certs/cert.crt" Environment="ETCD_DISCOVERY=https://discovery.etcd.io/" Environment="ETCD_KEY_FILE=/home/etcd/certs/key.pem" Environment="ETCD_PEER_ADDR=10.26.33.251:7001"

Attached the fleet & etcd logs from all nodes

etcd-10-26-31-100.txt etcd-10-26-32-94.txt etcd-10-26-33-251.txt fleet-10-26-31-100.txt fleet-10-26-32-94.txt fleet-10-26-33-251.txt

AWS status dashboard does not show any errors or issues on their end

Appreciate if someone can take a look at the above and give us any pointers on what to look at and what we can do to mitigate this.

I opened a fleet ticket - https://github.com/coreos/etcd/issues/7177 and was redirected to here

Thx Maulik etcd-10-26-31-100.txt etcd-10-26-32-94.txt etcd-10-26-33-251.txt fleet-10-26-31-100.txt fleet-10-26-32-94.txt fleet-10-26-33-251.txt

After reboots, timers sometimes broken due to missing service files

I'm running CoreOS Stable, 1122.3.0 on Google Compute Engine. (Thus: fleet 0.11.7.)

Sometimes, after a reboot, fleet-controlled timers try to start before their associated fleet-controlled associated services have been loaded, resulting in timer failures. I'd expect that the fleet launcher would wait until all the parts of a timer are loaded before starting. (Or maybe just load everything on a rebooted node before starting anything.)

A stripped log shows the sequence:

-- Reboot --
systemd[1]: Started fleet daemon.
fleetd[1221]: INFO fleetd.go:64: Starting fleetd version 0.11.7
fleetd[1221]: INFO manager.go:246: Writing systemd unit cd-pipeline-run.timer (118b)
fleetd[1221]: INFO manager.go:182: Instructing systemd to reload units
systemd[1]: cd-pipeline-run.timer: Refusing to start, unit to trigger not loaded.
systemd[1]: Failed to start Run the Classifier Data Pipeline.
fleetd[1221]: INFO manager.go:127: Triggered systemd unit cd-pipeline-run.timer start: job=1432
fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=cd-pipeline-run.timer reason="unit scheduled here but not loaded"
fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=cd-pipeline-run.timer reason="unit currently loaded but desired state is launched"
fleetd[1221]: INFO manager.go:246: Writing systemd unit cd-pipeline-run.service (2267b)
fleetd[1221]: INFO manager.go:182: Instructing systemd to reload units
fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=cd-pipeline-run.service reason="unit scheduled here but not loaded"

(The more complete log is in a Gist, here.)

I've yet to find the fleet option that tells it that a pair (or more) of unit files need to be handled together...

fleet ties together systemd and etcd into a distributed init system

Deprecation warning

fleet - a distributed init system

Current status

Using fleet

Supported Deployment Patterns

Getting Started

Building

Project Details

API

Release Notes

License

Owner

CoreOS

Comments

Something wrong with fleet 0.3.1 in CoreOS master 315.0.0+2014-05-13-2126

WIP: feat(engine): scheduling according to current cluster load

Dynamic Metadata

fleet: add replace unit support

Use gRPC to communicate the engine and agents

Add Consul support

Fleet is restarting on heavy load

improve experience working with template units

Cannot destroy and resubmit units on same host

systemd hides the fact that there exists a maximum unit file line length

Efficient resource utilisation, re-balancing mandate for Engine

Fix error formatting based on best practices from Code Review Comments

Here $SSHTIMEOUT not display proper value

CoreOs cluster restarted all containers due to fleet or etcd errors

/usr/lib64/systemd/system/etcd.service

/run/systemd/system/etcd.service.d/10-oem.conf

/run/systemd/system/etcd.service.d/20-cloudinit.conf

After reboots, timers sometimes broken due to missing service files

Related tags

Fleet - Open source device management, built on osquery.

draft terraform provider for Fleet

This simple service's purpose is to expose data regarding a vehicle fleet

Simple wrapper around multiple fs.FS instances, recursively merging them together dynamically.

Merge multiple pcap files together, gracefully.

Terraform provider for the etcd store

A helper tool for getting OpenShift/Kubernetes data directly from Etcd.

Dynamic service configuration with etcd.

A simple tool to sync your etcd cluster to PostgreSQL in realtime.

A letsencrypt client that uses etcd as its storage.

Etcd config dispenser

Hexagonal architecture paradigms, such as dividing adapters into primary (driver) and secondary (driven)Hexagonal architecture paradigms, such as dividing adapters into primary (driver) and secondary (driven)

An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

A set of components that can be composed into a highly available metric system with unlimited storage capacity

Code portion for Distributed System Final exam for Viktor Máni Mønster

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

Vilicus is an open source tool that orchestrates security scans of container images(docker/oci) and centralizes all results into a database for further analysis and metrics.

Service Discovery and Governance Center for Distributed and Microservice Architecture

Topology-tester - Application to easily test microservice topologies and distributed tracing including K8s and Istio