fleet ties together systemd and etcd into a distributed init system

WARNINGWARNINGWARNING Deprecation warning WARNINGWARNINGWARNING

fleet is no longer developed or maintained by CoreOS. After February 1, 2018, a fleet container image will continue to be available from the CoreOS Quay registry, but will not be shipped as part of Container Linux. CoreOS instead recommends Kubernetes for all clustering needs.

The project exists here for historical reference. If you are interested in the future of the project and taking over stewardship, please contact [email protected]

fleet - a distributed init system

Build Status Build Status

fleet ties together systemd and etcd into a simple distributed init system. Think of it as an extension of systemd that operates at the cluster level instead of the machine level.

This project is quite low-level, and is designed as a foundation for higher order orchestration. fleet is a cluster-wide elaboration on systemd units, and is not a container manager or orchestration system. fleet supports basic scheduling of systemd units across nodes in a cluster. Those looking for more complex scheduling requirements or a first-class container orchestration system should check out Kubernetes. The fleet and kubernetes comparison table has more information about the two systems.

Current status

The fleet project is no longer maintained.

As of v1.0.0, fleet has seen production use for some time and is largely considered stable. However, there are various known and unresolved issues, including scalability limitations with its architecture. As such, it is not recommended to run fleet clusters larger than 100 nodes or with more than 1000 services.

Using fleet

Launching a unit with fleet is as simple as running fleetctl start:

$ fleetctl start examples/hello.service
Unit hello.service launched on 113f16a7.../172.17.8.103

The fleetctl start command waits for the unit to get scheduled and actually start somewhere in the cluster. fleetctl list-unit-files tells you the desired state of your units and where they are currently scheduled:

$ fleetctl list-unit-files
UNIT            HASH     DSTATE    STATE     TMACHINE
hello.service   e55c0ae  launched  launched  113f16a7.../172.17.8.103

fleetctl list-units exposes the systemd state for each unit in your fleet cluster:

$ fleetctl list-units
UNIT            MACHINE                    ACTIVE   SUB
hello.service   113f16a7.../172.17.8.103   active   running

Supported Deployment Patterns

fleet is not intended to be an all-purpose orchestration system, and as such supports only a few simple deployment patterns:

  • Deploy a single unit anywhere on the cluster
  • Deploy a unit globally everywhere in the cluster
  • Automatic rescheduling of units on machine failure
  • Ensure that units are deployed together on the same machine
  • Forbid specific units from colocation on the same machine (anti-affinity)
  • Deploy units to machines only with specific metadata

These patterns are all defined using custom systemd unit options.

Getting Started

Before you can deploy units, fleet must be deployed and configured on each host in your cluster. (If you are running CoreOS, fleet is already installed.)

After you have machines configured (check fleetctl list-machines), get to work with the client.

Building

fleet must be built with Go 1.5+ on a Linux machine. Simply run ./build and then copy the binaries out of bin/ directory onto each of your machines. The tests can similarly be run by simply invoking ./test.

If you're on a machine without Go 1.5+ but you have Docker installed, run ./build-docker to compile the binaries instead.

Project Details

API

The fleet API uses JSON over HTTP to manage units in a fleet cluster. See the API documentation for more information.

Release Notes

See the releases tab for more information on each release.

License

fleet is released under the Apache 2.0 license. See the LICENSE file for details.

Specific components of fleet use code derivative from software distributed under other licenses; in those cases the appropriate licenses are stipulated alongside the code.

Owner
CoreOS
Key components to secure, simplify and automate your container infrastructure
CoreOS
Comments
  • Something wrong with fleet 0.3.1 in CoreOS master 315.0.0+2014-05-13-2126

    Something wrong with fleet 0.3.1 in CoreOS master 315.0.0+2014-05-13-2126

    core@core-3 ~ $ fleetctl list-units
    UNIT                STATE       LOAD    ACTIVE  SUB DESC            MACHINE
    etcd-amb-redis.service      launched    loaded  active  running Ambassador on A     d8695a82.../192.168.65.4
    etcd-amb-redis2.service     launched    loaded  active  running Ambassador on B     7382eb69.../192.168.65.2
    redis-demo.service      inactive    -   -   -   Redis on A      -
    redis-docker-reg.service    launched    loaded  active  running Register on A       d8695a82.../192.168.65.4
    redis-dyn-amb.service       launched    loaded  active  running Etcd Ambassador on B    7382eb69.../192.168.65.2
    core@core-3 ~ $ systemctl status redis-demo.service -l
    ● redis-demo.service - Redis on A
       Loaded: loaded (/run/fleet/units/redis-demo.service; linked-runtime)
       Active: active (running) since Tue 2014-05-13 23:01:36 UTC; 13min ago
      Process: 13338 ExecStartPre=/usr/bin/docker pull crosbymichael/redis (code=exited, status=0/SUCCESS)
     Main PID: 13610 (docker)
       CGroup: /system.slice/redis-demo.service
               └─13610 /usr/bin/docker run --rm --name redis-demo.service -p 192.168.65.4::6379 crosbymichael/redis
    
    May 13 23:01:36 core-3 docker[13610]: `-._    `-._`-.__.-'_.-'    _.-'
    May 13 23:01:36 core-3 docker[13610]: |`-._`-._    `-.__.-'    _.-'_.-'|
    May 13 23:01:36 core-3 docker[13610]: |    `-._`-._        _.-'_.-'    |
    May 13 23:01:36 core-3 docker[13610]: `-._    `-._`-.__.-'_.-'    _.-'
    May 13 23:01:36 core-3 docker[13610]: `-._    `-.__.-'    _.-'
    May 13 23:01:36 core-3 docker[13610]: `-._        _.-'
    May 13 23:01:36 core-3 docker[13610]: `-.__.-'
    May 13 23:01:36 core-3 docker[13610]: [1] 13 May 23:01:36.592 # Server started, Redis version 2.8.8
    May 13 23:01:36 core-3 docker[13610]: [1] 13 May 23:01:36.592 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
    May 13 23:01:36 core-3 docker[13610]: [1] 13 May 23:01:36.593 * The server is now ready to accept connections on port 6379
    
  • WIP: feat(engine): scheduling according to current cluster load

    WIP: feat(engine): scheduling according to current cluster load

    Introduces a job control object with the ability to schedule jobs according to current load in the cluster. Considers memory, cores and local disk job requirements and the same for load. Has clear defined dependencies on how to hook it into fleet and operate it (the hookup hasn't been done in this PR).

    Follow-up commits in this PR will flesh out the unit testing.

  • Dynamic Metadata

    Dynamic Metadata

    Updated 2/28/2015

    • Added new PATCH method to the machines collection endpoint to allow changing a machines metadata via the http endpoint.
    • API uses the jsonpatch format. Multiple machines can be modified at once, including machines that are not currently part of the cluster. "add", "replace", and "remove" operations are available.
    • Metadata modified via the api is retained even if the machine leaves and rejoins the cluster.
    • Dynamic metadata (metadata set by the user via the api) is merged with Machine metadata (metadata defined by the fleet config, env variables, or flags) using the following rules
      • Any keys that exist in only one of the two collections are added to the final collection as-is
      • Any keys that exist in both, the value set in the dynamic metadata is added to the final collection.
      • Any keys that are the string zero-value in the dynamic metadata are considered deleted and are not included in the final collection. This lets a user persistently delete a value set at configuration time, even if the machine leaves and rejoins the cluster.
    • The reconcile will reschedule any units on machines that no longer meet the metadata requirements (pre-existing functionality)

    Known Issues:

    • Docs not complete
  • fleet: add replace unit support

    fleet: add replace unit support

    This PR allows units to be replaced with "submit", "load" and "start" commands. Just add the new "--replace" switch.

    The previous discussion was about overwrite in this PR https://github.com/coreos/fleet/pull/1295

    This PR tries to fix: https://github.com/coreos/fleet/issues/760

  • Use gRPC to communicate the engine and agents

    Use gRPC to communicate the engine and agents

    This PR aims to provide a new communication mechanism to improve the performance, data transmission and unit state sharing between the fleet engine and agents in a fleet cluster.

    Motivation: In our infrastructure, we have experienced some issues with fleet in terms of scalability, performance and fault-tolerance. Therefore we'd like to present our ideas to help improve those areas.

    We use gRPC/HTTP2 as framework to expose all the required operations (schedule unit, destroy unit, save state, ...) that will allow to coordinate the engine (the fleet node elected as leader) with the agents. In this implementation, we provide a new registry that stores all the information in-memory. Nevertheless, it's also possible to use the etcd registry.

    Generally, this implementation provides two solutions as mentioned above. You can use etcd if that fits better your architecture or requirements. OR you can use the in-memory registry to reduce the dependencies with etcd (but not to avoid using etcd). In that direction, we found out in our infrastructure that a high workload over etcd induces into a poor or wrong behavior of fleet. Besides, we believe that the use of etcd to provide inter-process communication for the agents, it could end up into potential bottlenecks, as well as it has an impact into the fault tolerance of fleet.

    Additional information and plots about the motivation of this PR can be found below: https://github.com/coreos/fleet/pull/1426#issuecomment-181778260

    This PR has been labeled as WIP, we are still working on improvements, fault tolerance, bug fixing, etc :)

    NOTE: If you want to try this PR, you need to rebuild the Go dependencies, and preferably to use Go v1.5.1. This PR was so big that we were forced to exclude our new Go dependencies.

  • Add Consul support

    Add Consul support

    Consul comes with health checks, multi-datacenter support and other very nice features. Many people prefer it over etcd, so it would be great if fleet had Consul support.

  • Fleet is restarting on heavy load

    Fleet is restarting on heavy load

    When I run fleet on an heavy load (I start two services every second) it restarts really often (more than once per minute).

    What I see is that it kills all the running services and it restarts them once again. The services disappear also from the fleetctl list-units while it is restarting.

    I have installed CoreOS alpha in 3 bare metal servers. One of this has a ping time of 2ms and the other 0.2.

    Etcd has the default settings and the cloud-init config file is quite standard

    Those are the logs from etcd

    Sep 26 21:49:39 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:39.462 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:39 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:39.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m20.399085725s ago
    Sep 26 21:49:40 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:40.095 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:40 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:40.466 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
    Sep 26 21:49:40 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:40.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m21.399033659s ago
    Sep 26 21:49:41 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:41.237 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:41 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:41.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m22.399033018s ago
    Sep 26 21:49:42 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:42.310 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:42 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:42.625 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=2 backoff="4s"
    Sep 26 21:49:42 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:42.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m23.399102714s ago
    Sep 26 21:49:43 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:43.519 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:43 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:43.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m24.399055692s ago
    Sep 26 21:49:44 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:44.153 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:44 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:44.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m25.399057728s ago
    Sep 26 21:49:45 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:45.361 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:45 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:45.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m26.399044122s ago
    Sep 26 21:49:46 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:46.143 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:46 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:46.974 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=4 backoff="8s"
    Sep 26 21:49:46 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:46.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m27.398971006s ago
    Sep 26 21:49:47 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:47.360 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:47 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:47.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m28.399056164s ago
    Sep 26 21:49:48 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:48.235 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:48 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:48.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m29.399041092s ago
    Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.136 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.136 INFO      | core_188_165_248_226: removing node: ; last activity 43m30.556753513s ago
    Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.385 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:50 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:50.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m31.399012358s ago
    Sep 26 21:49:51 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:51.118 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:51 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:51.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m32.399010584s ago
    Sep 26 21:49:52 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:52.294 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:52 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:52.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m33.399094304s ago
    Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.121 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.121 INFO      | core_188_165_248_226: removing node: ; last activity 43m34.540921885s ago
    Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.377 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:54 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:54.980 INFO      | core_188_165_248_226: removing node: ; last activity 43m35.400533653s ago
    Sep 26 21:49:55 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:55.101 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:55 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:55.580 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
    Sep 26 21:49:55 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:55.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m36.399080664s ago
    Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.623 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.623 INFO      | core_188_165_248_226: removing node: ; last activity 43m38.043724209s ago
    Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.818 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:57 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:57.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m38.399247326s ago
    Sep 26 21:49:58 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:58.202 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:58 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:58.515 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
    Sep 26 21:49:58 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:58.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m39.398997676s ago
    Sep 26 21:49:59 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:59.445 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:49:59 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:49:59.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m40.399139946s ago
    Sep 26 21:50:00 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:00.218 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:00 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:00.600 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=3 backoff="4s"
    Sep 26 21:50:00 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:00.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m41.399177445s ago
    Sep 26 21:50:01 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:01.409 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:01 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:01.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m42.399013462s ago
    Sep 26 21:50:02 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:02.280 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:02 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:02.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m43.399042518s ago
    Sep 26 21:50:03 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:03.101 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:03 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:03.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m44.399032916s ago
    Sep 26 21:50:04 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:04.699 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:04 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:04.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m45.399022199s ago
    Sep 26 21:50:05 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:05.147 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
    Sep 26 21:50:05 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:05.360 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:05 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:05.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m46.399110631s ago
    Sep 26 21:50:06 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:06.251 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:06 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:06.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m47.399128338s ago
    Sep 26 21:50:07 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:07.531 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=1 backoff="2s"
    Sep 26 21:50:09 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:09.632 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=6 backoff="4s"
    Sep 26 21:50:13 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:13.836 INFO      | core_188_165_248_226: warning: heartbeat time out peer="core_188_165_201_181" missed=17 backoff="8s"
    Sep 26 21:50:14 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:14.116 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:14 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:14.116 INFO      | core_188_165_248_226: removing node: ; last activity 43m54.536329299s ago
    Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.247 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.247 INFO      | core_188_165_248_226: removing node: ; last activity 43m56.666983986s ago
    Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.565 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:16 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:16.979 INFO      | core_188_165_248_226: removing node: ; last activity 43m57.399109068s ago
    Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.251 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.251 INFO      | core_188_165_248_226: removing node: ; last activity 44m1.671675901s ago
    Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.651 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:21 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:21.979 INFO      | core_188_165_248_226: removing node: ; last activity 44m2.398967504s ago
    Sep 26 21:50:22 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:22.325 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:22 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:22.979 INFO      | core_188_165_248_226: removing node: ; last activity 44m3.399087729s ago
    Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.626 INFO      | core_188_165_248_226: snapshot of 10109 events at index 29152983 completed
    Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.629 INFO      | core_188_165_248_226: warning: autodemotion error: Not a file (/_etcd/machines)
    Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.637 INFO      | core_188_165_248_226: state changed from 'leader' to 'follower'.
    Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.637 INFO      | core_188_165_248_226: term #4216 started.
    Sep 26 21:50:23 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:50:23.637 INFO      | core_188_165_248_226: leader changed from 'core_188_165_248_226' to ''.
    Sep 26 21:53:53 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:53:53.056 INFO      | core_188_165_248_226: snapshot of 10001 events at index 29162984 completed
    Sep 26 21:57:53 core_188_165_248_226 etcd[654]: [etcd] Sep 26 21:57:53.385 INFO      | core_188_165_248_226: snapshot of 10126 events at index 29173110 completed
    

    And those from fleet:

    Sep 26 21:49:00 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
    Sep 26 21:49:00 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:08 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:09 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:14 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
    Sep 26 21:49:14 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:16 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
    Sep 26 21:49:16 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:20 core_188_165_248_226 fleetd[655]: ERROR engine.go:105: Engine leadership acquisition failed: timeout reached
    Sep 26 21:49:20 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    Sep 26 21:49:21 core_188_165_248_226 fleetd[655]: INFO client.go:278: Failed getting response from http://localhost:4001/: cancelled
    

    So the issue is caused by etcd that becomes unreachable. Should I raise the etcd timeouts?

  • improve experience working with template units

    improve experience working with template units

    It should not be possible for template units to be scheduled to a system. Right now the experience is not great; it will be scheduled, but then cause chronic issues with the agent on that machine, e.g.:

    Oct 14 17:26:41 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name [email protected] is not valid.
    Oct 14 17:26:42 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name [email protected] is not valid.
    Oct 14 17:26:43 core-01 fleetd[557]: ERROR generator.go:51: Failed fetching current unit states: Unit name [email protected] is not valid.
    

    (Really, any unit with a bad name should never be scheduled; but fleetctl should now block all bad names except for template units).

    Related: #541

  • Cannot destroy and resubmit units on same host

    Cannot destroy and resubmit units on same host

    When trying to destroy and resubmit units on the same host, fleet gets confused and units wind up in a not-found and failed state.

    Note the session below is using fleet 0.3-rc1, but similar behavior has existed in 0.2:

    $ fleetctl --version
    fleetctl version 0.3.0-rc.1
    $ fleetctl destroy deis-cache.service
    Destroyed Job deis-cache.service
    $ fleetctl list-units
    UNIT    STATE   LOAD    ACTIVE  SUB DESC    MACHINE
    $ fleetctl submit cache/systemd/deis-cache.service
    $ fleetctl list-units
    UNIT            STATE       LOAD        ACTIVE  SUB DESC        MACHINE
    deis-cache.service  inactive    not-found   failed  failed  deis-cache  951a306d.../172.17.8.100
    $ fleetctl status deis-cache.service
    ● deis-cache.service
       Loaded: not-found (Reason: No such file or directory)
       Active: failed (Result: exit-code) since Wed 2014-05-07 17:43:28 UTC; 21min ago
     Main PID: 3506 (code=exited, status=1/FAILURE)
    
    May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.133 # User requested shutdown...
    May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.133 * Saving the final RDB snapshot before exiting.
    May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.135 * DB saved on disk
    May 07 17:43:28 deis-1 sh[3506]: [35] 07 May 17:43:28.135 # Redis is now ready to exit, bye bye...
    May 07 17:43:28 deis-1 systemd[1]: deis-cache.service: main process exited, code=exited, status=1/FAILURE
    May 07 17:43:28 deis-1 docker[16081]: deis-cache
    May 07 17:43:28 deis-1 systemd[1]: Stopped deis-cache.
    May 07 17:43:28 deis-1 systemd[1]: Unit deis-cache.service entered failed state.
    May 07 17:55:36 deis-1 systemd[1]: Stopped deis-cache.service.
    May 07 18:02:59 deis-1 systemd[1]: Stopped deis-cache.service.
    $ fleetctl cat deis-cache.service
    [Unit]
    Description=deis-cache
    
    [Service]
    EnvironmentFile=/etc/environment
    TimeoutStartSec=20m
    ExecStartPre=/bin/sh -c "/usr/bin/docker history deis/cache >/dev/null || /usr/bin/docker pull deis/cache"
    ExecStartPre=/bin/sh -c "/usr/bin/docker inspect deis-cache >/dev/null && /usr/bin/docker rm -f deis-cache || true"
    ExecStart=/bin/sh -c "docker run --name deis-cache -p 6379:6379 -e PUBLISH=6379 -e HOST=$COREOS_PRIVATE_IPV4 deis/cache"
    ExecStop=/usr/bin/docker rm -f deis-cache
    
    [Install]
    WantedBy=multi-user.target
    
  • systemd hides the fact that there exists a maximum unit file line length

    systemd hides the fact that there exists a maximum unit file line length

    Hello,

    I think that the ExecStart in the unit files has a limitation with the length of the command. It cuts the command and indicates "Missing '='.

    Does this limit exists really ?

    This is the error:

    Oct 21 22:01:44 ip-172-31-20-195.ec2.internal systemd[1]: [/run/fleet/units/e-88747910-a986-4ed0-9b10-3af9.minion.service:9] Missing '='.
    Oct 21 22:01:45 ip-172-31-20-195.ec2.internal systemd[1]: [/run/fleet/units/e-88747910-a986-4ed0-9b10-3af9.minion.service:8] String is not UTF-8 clean, ignoring assignment: /bin/sh -c "while true; do ; /usr/bin/sleep 3; public_ipv4=$COREOS_PUBLIC_IPV4;  e88747910a9864ed09b103af9_8181=$(docker port e-88747910-a986-4ed0-9b10-3af9 8181 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_8101=$(docker port e-88747910-a986-4ed0-9b10-3af9 8101 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_4444=$(docker port e-88747910-a986-4ed0-9b10-3af9 4444 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_5555=$(docker port e-88747910-a986-4ed0-9b10-3af9 5555 | cut -d ':' -f 2 ); etcdctl set /_esb/user/john/instance/e-88747910-a986-4ed0-9b10-3af9 \"{\\\"ID\\\":\\\"e-88747910-a986-4ed0-9b10-3af9\\\",\\\"Labels\\\":{\\\"owner\\\":\\\"john\\\"},\\\"Created\\\":\\\"Tuesday, 21-Oct-14 21:53:59 UTC\\\",\\\"State\\\":\\\"running\\\",\\\"IP\\\":\\\"$public_ipv4\\\",\\\"Configuration\\\":{\\\"CPU_SHARED\\\":\\\"\\\",\\\"ESB\\\":\\\"servicemix\\\",\\\"Features\\\":[\\\"camel-sql\\\"],\\\"ID\\\":\\\"i-124691ed-f613-4ea5-ab6b-05c7\\\",\\\"JBI_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"JBI_corePoolSize\\\":\\\"4\\\",\\\"JBI_keepAliveTime\\\":\\\"60000\\\",\\\"JBI_maximumPoolSize\\\":\\\"-1\\\",\\\"JBI_queueSize\\\":\\\"1024\\\",\\\"JBI_shutdownTimeout\\\":\\\"0\\\",\\\"JVM_MAX_MEM\\\":\\\"\\\",\\\"JVM_MAX_PERM_MEM\\\":\\\"\\\",\\\"JVM_MIN_MEM\\\":\\\"\\\",\\\"JVM_PERM_MEM\\\":\\\"\\\",\\\"MAX_MEM\\\":\\\"\\\",\\\"NMR_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"NMR_corePoolSize\\\":\\\"4\\\",\\\"NMR_keepAliveTime\\\":\\\"60000\\\",\\\"NMR_maximumPoolSize\\\":\\\"-1\\\",\\\"NMR_queueSize\\\":\\\"1024\\\",\\\"NMR_shutdownTimeout\\\":\\\"0\\\",\\\"Name\\\":\\\"\\\",\\\"Ports\\\":{\\\"endpoint1\\\":4444,\\\"endpoint2\\\":5555},\\\"SERVICEMIX_PASSWORD\\\":\\\"smx\\\",\\\"SERVICEMIX_USER\\\":\\\"smx\\\"},\\\"PortBindings\\\":{\\\"4444\\\":\\\"$e88747910a9864ed09b103af9_4444\\\",\\\"5555\\\":\\\"$e88747910a9864ed09b103af9_5555\\\",\\\"8101\\\":\\\"$e88747910a9864ed09b103af9_8101\\\",\\\"8181\\\":\\\"$e88747910a9864ed09b103af9_8181\\\"},\\\"UserAccess\\\":{\\\"Karaf Console\\\":\\\"$public_ipv4$e88747910a9864ed09b103af
    

    And this is the unit file:

    [Unit]
    Description=Info docker e-88747910-a986-4ed0-9b10-3af9 ESB Instance
    After=e-88747910-a986-4ed0-9b10-3af9.service
    Requires=e-88747910-a986-4ed0-9b10-3af9.service
    
    [Service]
    EnvironmentFile=/etc/environment
    ExecStart=/bin/sh -c "while true; do ; /usr/bin/sleep 3; public_ipv4=$COREOS_PUBLIC_IPV4;  e88747910a9864ed09b103af9_8181=$(docker port e-88747910-a986-4ed0-9b10-3af9 8181 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_8101=$(docker port e-88747910-a986-4ed0-9b10-3af9 8101 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_4444=$(docker port e-88747910-a986-4ed0-9b10-3af9 4444 | cut -d ':' -f 2 ); e88747910a9864ed09b103af9_5555=$(docker port e-88747910-a986-4ed0-9b10-3af9 5555 | cut -d ':' -f 2 ); etcdctl set /_esb/user/john/instance/e-88747910-a986-4ed0-9b10-3af9 \"{\\\"ID\\\":\\\"e-88747910-a986-4ed0-9b10-3af9\\\",\\\"Labels\\\":{\\\"owner\\\":\\\"john\\\"},\\\"Created\\\":\\\"Tuesday, 21-Oct-14 21:53:59 UTC\\\",\\\"State\\\":\\\"running\\\",\\\"IP\\\":\\\"$public_ipv4\\\",\\\"Configuration\\\":{\\\"CPU_SHARED\\\":\\\"\\\",\\\"ESB\\\":\\\"servicemix\\\",\\\"Features\\\":[\\\"camel-sql\\\"],\\\"ID\\\":\\\"i-124691ed-f613-4ea5-ab6b-05c7\\\",\\\"JBI_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"JBI_corePoolSize\\\":\\\"4\\\",\\\"JBI_keepAliveTime\\\":\\\"60000\\\",\\\"JBI_maximumPoolSize\\\":\\\"-1\\\",\\\"JBI_queueSize\\\":\\\"1024\\\",\\\"JBI_shutdownTimeout\\\":\\\"0\\\",\\\"JVM_MAX_MEM\\\":\\\"\\\",\\\"JVM_MAX_PERM_MEM\\\":\\\"\\\",\\\"JVM_MIN_MEM\\\":\\\"\\\",\\\"JVM_PERM_MEM\\\":\\\"\\\",\\\"MAX_MEM\\\":\\\"\\\",\\\"NMR_allowCoreThreadTimeOut\\\":\\\"true\\\",\\\"NMR_corePoolSize\\\":\\\"4\\\",\\\"NMR_keepAliveTime\\\":\\\"60000\\\",\\\"NMR_maximumPoolSize\\\":\\\"-1\\\",\\\"NMR_queueSize\\\":\\\"1024\\\",\\\"NMR_shutdownTimeout\\\":\\\"0\\\",\\\"Name\\\":\\\"\\\",\\\"Ports\\\":{\\\"endpoint1\\\":4444,\\\"endpoint2\\\":5555},\\\"SERVICEMIX_PASSWORD\\\":\\\"smx\\\",\\\"SERVICEMIX_USER\\\":\\\"smx\\\"},\\\"PortBindings\\\":{\\\"4444\\\":\\\"$e88747910a9864ed09b103af9_4444\\\",\\\"5555\\\":\\\"$e88747910a9864ed09b103af9_5555\\\",\\\"8101\\\":\\\"$e88747910a9864ed09b103af9_8101\\\",\\\"8181\\\":\\\"$e88747910a9864ed09b103af9_8181\\\"},\\\"UserAccess\\\":{\\\"Karaf Console\\\":\\\"$public_ipv4$e88747910a9864ed09b103af9_8101\\\",\\\"Web Console\\\":\\\"http://$public_ipv4:$e88747910a9864ed09b103af9_8181/system/console\\\"},\\\"Error\\\":\\\"\\\"}\" --ttl 90; /usr/bin/sleep 30; done"
    ExecStop=/usr/bin/echo stopped
    
    [X-Fleet]
    MachineOf=e-88747910-a986-4ed0-9b10-3af9.service
    
  • Efficient resource utilisation, re-balancing mandate for Engine

    Efficient resource utilisation, re-balancing mandate for Engine

    I'm looking for a way for improving Engine scheduling, so that it's decisions could based on machine metrics (available CPU, memory) rather than 'number of running units'. For the longer term, I would love to see this exhibit the ability to plug-in different scheduling heuristics (first globally, but maybe later - per unit). As a first step, an MDP if you will, if each machine could report a 'score' on itself, then the engine could just pick the machine with the highest score. This would be a huge improvement over today's scheduling. On the even longer term I'm looking to have a re-balancing feature, where by scheduling would have mandate to re-assign units to other machines. This is probably a separate development thread, but it could build on the previous one I mentioned.

    So to give a couple of examples:

    • My cluster is composed of two single core computers with 2Gb of memory, and one dual core with 4Gb of memory. Fleet would run twice as many units on the stronger computer.
    • My cluster is running at near capacity. I add a node to the cluster. fleet re-balances the load by moving some units around
    • Some of my units consume a lot of CPU and memory, and some consume very little. Fleet would schedule units based on their resource utilisation and free capacity.

    Note, Even a partial implementation of this approach would give huge advantages, such as - being able to efficiently use machines of different capacities within the same cluster.

    This is related to #555, I think #555 is a required step to achieve what I have outlined above.

  • 
Fix error formatting based on best practices from Code Review Comments

    Fix error formatting based on best practices from Code Review Comments

  • Here $SSHTIMEOUT not display proper value

    Here $SSHTIMEOUT not display proper value

    [Unit] Description=SSH Per-Connection Server After=syslog.target

    [Service] EnvironmentFile=-/etc/default/dropbear #ExecStartPre=/ect/scripts/ssh_timeout.sh ExecStartPre=/bin/sh -c 'SSHTIMEOUT=$(/bin/cat /etc/atom/defaults/syscfg_baseline.db | /bin/grep ssh_timeout | /bin/cut -d "=" -f2)' ExecStart=/usr/sbin/dropbear -i -I $SSHTIMEOUT -r /var/tmp/dropbear_rsa_host_key -p 22 $DROPBEAR_EXTRA_ARGS ExecReload=/bin/kill -HUP $MAINPID StandardInput=socket

  • CoreOs cluster restarted all containers due to fleet or etcd errors

    CoreOs cluster restarted all containers due to fleet or etcd errors

    Hello We just saw a pretty server issue on our production CoreOs setup. Details are:

    • 3 CoreOs nodes running in AWS EC2 Us East1
    • m3.2xlarge instance types
    • CoreOS nodes - 2 are DISTRIB_RELEASE=1068.2.0 and 1 is at DISTRIB_RELEASE=1081.5.0
    • etcd version 0.4.9
    • we have auto update disabled on CoreOs
    • Around 21:56 UTC on Jan 17 we saw all our containers go down and the logs seemed to suggest an issue with etcd

    Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:157: Establishing etcd connectivity Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:179: Engine leadership acquisition failed: context deadline exceeded Jan 17 21:59:41 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:168: Starting server components Jan 17 21:59:42 ip-10-26-31-100.ec2.internal fleetd[999]: INFO engine.go:185: Engine leadership acquired Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(kafka-broker-1.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR reconciler.go:62: Failed resolving task: task={Type: UnscheduleUnit, JobName: kafka-broker-1.service, MachineID: 6ca65ead2f164b2682c0d941c Jan 17 21:59:44 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(newNewApps.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded

    • We checked the CPU and disk IO for all 3 instances, there is NO indication of any CPU spike per AWS Cloudwatch
    • ETCD config is as below core@ip-10-26-33-251 ~ $ sudo systemctl cat etcd

    /usr/lib64/systemd/system/etcd.service

    [Unit] Description=etcd Conflicts=etcd2.service

    [Service] User=etcd PermissionsStartOnly=true Environment=ETCD_DATA_DIR=/var/lib/etcd Environment=ETCD_NAME=%m ExecStart=/usr/bin/etcd Restart=always RestartSec=10s LimitNOFILE=40000

    /run/systemd/system/etcd.service.d/10-oem.conf

    [Service] Environment=ETCD_PEER_ELECTION_TIMEOUT=1200

    /run/systemd/system/etcd.service.d/20-cloudinit.conf

    [Service] Environment="ETCD_ADDR=10.26.33.251:4001" Environment="ETCD_CERT_FILE=/home/etcd/certs/cert.crt" Environment="ETCD_DISCOVERY=https://discovery.etcd.io/" Environment="ETCD_KEY_FILE=/home/etcd/certs/key.pem" Environment="ETCD_PEER_ADDR=10.26.33.251:7001"

    • Attached the fleet & etcd logs from all nodes

    etcd-10-26-31-100.txt etcd-10-26-32-94.txt etcd-10-26-33-251.txt fleet-10-26-31-100.txt fleet-10-26-32-94.txt fleet-10-26-33-251.txt

    • AWS status dashboard does not show any errors or issues on their end

    Appreciate if someone can take a look at the above and give us any pointers on what to look at and what we can do to mitigate this.

    I opened a fleet ticket - https://github.com/coreos/etcd/issues/7177 and was redirected to here

    Thx Maulik etcd-10-26-31-100.txt etcd-10-26-32-94.txt etcd-10-26-33-251.txt fleet-10-26-31-100.txt fleet-10-26-32-94.txt fleet-10-26-33-251.txt

  • After reboots, timers sometimes broken due to missing service files

    After reboots, timers sometimes broken due to missing service files

    I'm running CoreOS Stable, 1122.3.0 on Google Compute Engine. (Thus: fleet 0.11.7.)

    Sometimes, after a reboot, fleet-controlled timers try to start before their associated fleet-controlled associated services have been loaded, resulting in timer failures. I'd expect that the fleet launcher would wait until all the parts of a timer are loaded before starting. (Or maybe just load everything on a rebooted node before starting anything.)

    A stripped log shows the sequence:

    -- Reboot --
    systemd[1]: Started fleet daemon.
    fleetd[1221]: INFO fleetd.go:64: Starting fleetd version 0.11.7
    fleetd[1221]: INFO manager.go:246: Writing systemd unit cd-pipeline-run.timer (118b)
    fleetd[1221]: INFO manager.go:182: Instructing systemd to reload units
    systemd[1]: cd-pipeline-run.timer: Refusing to start, unit to trigger not loaded.
    systemd[1]: Failed to start Run the Classifier Data Pipeline.
    fleetd[1221]: INFO manager.go:127: Triggered systemd unit cd-pipeline-run.timer start: job=1432
    fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=cd-pipeline-run.timer reason="unit scheduled here but not loaded"
    fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
    fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=cd-pipeline-run.timer reason="unit currently loaded but desired state is launched"
    fleetd[1221]: INFO manager.go:246: Writing systemd unit cd-pipeline-run.service (2267b)
    fleetd[1221]: INFO manager.go:182: Instructing systemd to reload units
    fleetd[1221]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=cd-pipeline-run.service reason="unit scheduled here but not loaded"
    

    (The more complete log is in a Gist, here.)

    I've yet to find the fleet option that tells it that a pair (or more) of unit files need to be handled together...

Related tags
Fleet - Open source device management, built on osquery.
Fleet - Open source device management, built on osquery.

Fleet - Open source device management, built on osquery.

Dec 30, 2022
draft terraform provider for Fleet

Fleet Terraform provider This repo is a proof of concept of how a fleet provider for terraform could work Build provider Run the following command to

Oct 5, 2021
This simple service's purpose is to expose data regarding a vehicle fleet

A Small API This simple service's purpose is to expose data regarding a vehicle

Dec 16, 2021
Simple wrapper around multiple fs.FS instances, recursively merging them together dynamically.

go-layerfs This is a simple wrapper around multiple fs.FS instances, recursively merging them together dynamically. If you have two directories, of wh

Aug 9, 2022
Merge multiple pcap files together, gracefully.

joincap Merge multiple pcap files together, gracefully. Installation Download a precompiled binary from https://github.com/assafmo/joincap/releases Or

Dec 3, 2022
Terraform provider for the etcd store

About This is a terraform provider for etcd. Its scope is currently limited to the following resources: roles users keys We'll add further functionali

Nov 19, 2022
A helper tool for getting OpenShift/Kubernetes data directly from Etcd.

Etcd helper A helper tool for getting OpenShift/Kubernetes data directly from Etcd. How to build $ go build . Basic Usage This requires setting the f

Dec 10, 2021
Dynamic service configuration with etcd.

dynconf This Go package provides a dynamic service configuration backed by etcd, so there should be no need to redeploy a service to change its settin

Dec 6, 2021
A simple tool to sync your etcd cluster to PostgreSQL in realtime.

etcd-postgresql-syncer A simple tool to sync your etcd cluster to PostgreSQL in realtime. It sets up a watcher on etcd and commits all changes to Post

Jan 20, 2022
A letsencrypt client that uses etcd as its storage.

letsencrypt-with-etcd This is a letsencrypt client that uses etcd as its storage. It stores your (automatically created) LetsEncrypt account in /letse

Jan 20, 2022
Etcd config dispenser

etcd-config-dispenser Some things are best explained with an example: I use lets

Jan 20, 2022
Hexagonal architecture paradigms, such as dividing adapters into primary (driver) and secondary (driven)Hexagonal architecture paradigms, such as dividing adapters into primary (driver) and secondary (driven)

authorizer Architecture In this project, I tried to apply hexagonal architecture paradigms, such as dividing adapters into primary (driver) and second

Dec 7, 2021
An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers
An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

Developer-oriented Continuous Delivery Product ⁣ English | 简体中文 Table of Contents Zadig Table of Contents What is Zadig Quick start How to use? How to

Oct 19, 2021
A set of components that can be composed into a highly available metric system with unlimited storage capacity
A set of components that can be composed into a highly available metric system with unlimited storage capacity

Overview Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity, which can be added

Oct 20, 2021
Code portion for Distributed System Final exam for Viktor Máni Mønster

DISYS Final Exam (Distributed Hash Table) How to run There are two components of the system, which need to be run in separate ways. As this is a distr

Jan 5, 2022
Vilicus is an open source tool that orchestrates security scans of container images(docker/oci) and centralizes all results into a database for further analysis and metrics.
Vilicus is an open source tool that orchestrates security scans of container images(docker/oci) and centralizes all results into a database for further analysis and metrics.

Vilicus Table of Contents Overview How does it work? Architecture Development Run deployment manually Usage Example of analysis Overview Vilicus is an

Dec 6, 2022
Service Discovery and Governance Center for Distributed and Microservice Architecture
Service Discovery and Governance Center for Distributed and Microservice Architecture

Polaris: Service Discovery and Governance English | 简体中文 README: Introduction Components Getting started Chat group Contribution Visit website to lear

Dec 31, 2022
Topology-tester - Application to easily test microservice topologies and distributed tracing including K8s and Istio

Topology Tester The Topology Tester app allows you to quickly build a dynamic mi

Jan 14, 2022