Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications

Nomad Build Status Discuss

HashiCorp Nomad logo

Nomad is a simple and flexible workload orchestrator to deploy and manage containers (docker, podman), non-containerized applications (executable, Java), and virtual machines (qemu) across on-prem and clouds at scale.

Nomad is supported on Linux, Windows, and macOS. A commercial version of Nomad, Nomad Enterprise, is also available.

Nomad provides several key features:

  • Deploy Containers and Legacy Applications: Nomad’s flexibility as an orchestrator enables an organization to run containers, legacy, and batch applications together on the same infrastructure. Nomad brings core orchestration benefits to legacy applications without needing to containerize via pluggable task drivers.

  • Simple & Reliable: Nomad runs as a single binary and is entirely self contained - combining resource management and scheduling into a single system. Nomad does not require any external services for storage or coordination. Nomad automatically handles application, node, and driver failures. Nomad is distributed and resilient, using leader election and state replication to provide high availability in the event of failures.

  • Device Plugins & GPU Support: Nomad offers built-in support for GPU workloads such as machine learning (ML) and artificial intelligence (AI). Nomad uses device plugins to automatically detect and utilize resources from hardware devices such as GPU, FPGAs, and TPUs.

  • Federation for Multi-Region, Multi-Cloud: Nomad was designed to support infrastructure at a global scale. Nomad supports federation out-of-the-box and can deploy applications across multiple regions and clouds.

  • Proven Scalability: Nomad is optimistically concurrent, which increases throughput and reduces latency for workloads. Nomad has been proven to scale to clusters of 10K+ nodes in real-world production environments.

  • HashiCorp Ecosystem: Nomad integrates seamlessly with Terraform, Consul, Vault for provisioning, service discovery, and secrets management.

Quick Start

Testing

See Learn: Getting Started for instructions on setting up a local Nomad cluster for non-production use.

Optionally, find Terraform manifests for bringing up a development Nomad cluster on a public cloud in the terraform directory.

Production

See Learn: Nomad Reference Architecture for recommended practices and a reference architecture for production deployments.

Documentation

Full, comprehensive documentation is available on the Nomad website: https://www.nomadproject.io/docs

Guides are available on HashiCorp Learn.

Contributing

See the contributing directory for more developer documentation.

Owner
HashiCorp
Consistent workflows to provision, secure, connect, and run any infrastructure for any application.
HashiCorp
Comments
  • Persistent data on nodes

    Persistent data on nodes

    Nomad should have some way for tasks to acquire persistent storage on nodes. In a lot of cases, we might want to run our own hdfs or ceph cluster on nomad.

    That means, things like hdfs' datanodes needs to be able to reserve persistent storage on the node it is launched on. If the whole cluster goes down, once its brought back up, the appropriate tasks should be launched on its original nodes (where possible), so that it can gain access to data it has previously written.

  • Specify logging driver and options for docker driver

    Specify logging driver and options for docker driver

    Please correct me if I am wrong, but I couldn't find in the documentation how to pass the log-driver and log-opt arguments to containers when running them as Nomad tasks, e.g.: --log-driver=awslogs --log-opt awslogs-region=us-east-1 --log-opt awslogs-group=myLogGroup --log-opt awslogs-stream=myLogStream

    I know I can configure the docker daemon with these arguments, but then I can't specify different logstreams for each container. If this is currently not possible, I would like to request it as a feature. Thank you

  • Constraint

    Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter

    Nomad version

    Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

    Operating system and Environment details

    Ubuntu 20.04 LTS

    Issue

    Cannot re-plan jobs due to CSI volumes being claimed. I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using nomad volume create command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.

    What confuses me is the warning. It differs every time I run job plan. First I saw

    - WARNING: Failed to place all allocations.
      Task Group "zookeeper1" (failed to place 1 allocation):
        * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 2 nodes excluded by filter
    
      Task Group "zookeeper2" (failed to place 1 allocation):
        * Constraint "CSI volume zookeeper2-data has exhausted its available writer claims": 2 nodes excluded by filter
    

    Then, runnig job plan again a few seconds after, I got

    - WARNING: Failed to place all allocations.
      Task Group "zookeeper1" (failed to place 1 allocation):
        * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 2 nodes excluded by filter
    
      Task Group "zookeeper2" (failed to place 1 allocation):
        * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter
    

    Then again,

    - WARNING: Failed to place all allocations.
      Task Group "zookeeper1" (failed to place 1 allocation):
        * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 1 nodes excluded by filter
        * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 1 nodes excluded by filter
    
      Task Group "zookeeper2" (failed to place 1 allocation):
        * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter
    

    I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.

    This is the output of nomad volume status.

    Container Storage Interface
    ID                           Name                Plugin ID  Schedulable  Access Mode
    zookeeper1-data     zookeeper1-data     ceph-csi   true         single-node-writer
    zookeeper1-datalog  zookeeper1-datalog  ceph-csi   true         single-node-writer
    zookeeper2-data     zookeeper2-data     ceph-csi   true         single-node-writer
    zookeeper2-datalog  zookeeper2-datalog  ceph-csi   true         single-node-writer
    zookeeper3-data     zookeeper3-data     ceph-csi   true         <none>
    zookeeper3-datalog  zookeeper3-datalog  ceph-csi   true         <none>
    

    It says that they are schedulable. This is the output of nomad volume status zookeeper1-datalog:

    ID                   = zookeeper1-datalog
    Name                 = zookeeper1-datalog
    External ID          = 0001-0024-72f28a72-0434-4045-be3a-b5165287253f-0000000000000003-72ec315b-e9f5-11eb-8af7-0242ac110002
    Plugin ID            = ceph-csi
    Provider             = cephfs.nomad.example.com
    Version              = v3.3.1
    Schedulable          = true
    Controllers Healthy  = 2
    Controllers Expected = 2
    Nodes Healthy        = 2
    Nodes Expected       = 2
    Access Mode          = single-node-writer
    Attachment Mode      = file-system
    Mount Options        = <none>
    Namespace            = default
    
    Allocations
    No allocations placed
    

    It says there, there are no allocations placed.

    Reproduction steps

    This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with nomad job stop -purge. No, doing nomad system gc, nomad system reconcile summary, or restarting Nomad does not work.

    Expected Result

    Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using nomad volume create so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).

    Actual Result

    See error logs above.

    Job file (if appropriate)

    I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):

        volume "data" {
          type = "csi"
          read_only = false
          source = "zookeeper2-data"
          attachment_mode = "file-system"
          access_mode     = "single-node-writer"
    
          mount_options {
            fs_type     = "ext4"
            mount_flags = ["noatime"]
          }
        }
        volume "datalog" {
          type = "csi"
          read_only = false
          source = "zookeeper2-datalog"
          attachment_mode = "file-system"
          access_mode     = "single-node-writer"
    
          mount_options {
            fs_type     = "ext4"
            mount_flags = ["noatime"]
          }
        }
    

    All groups have count = 1.

  • Ability to select private/public IP for specific task/port

    Ability to select private/public IP for specific task/port

    Extracted from #209.

    We use Nomad with Docker driver to operate cluster of machines. Some of them have both public and private interfaces. These two-NIC machines run internal services that need to listen only on a private interface, as well as public services, which should listen on a public interface.

    So we need a way of specifying whether some task should listen on public or private IP.

    I think this can be generalized to the ability to specify subnet mask for a specific port:

    resources {
        network {
            mbits = 100
            port "http" {
                # Listen on all interfaces that match this mask; the task will not be
                # started on a machine that has no NICs with IPs in this subnet.
                netmask = "10.10.0.1/16"
            }
            port "internal-bus" {
                # The same with static port number
                static = 4050
                netmask = "127.0.0.1/32"
            }
        }
    }
    

    This would be the most flexible solution that would cover most, if not all, cases. For example, to listen on all interfaces, as requested in #209, you would just pass 0.0.0.0/0 netmask that matches all possible IPs. Maybe it makes sense to make this netmask the default, i.e. bind to all interfaces if no netmask is specified for a port.

    I think this is really important feature, because its lack prevents people from running Nomad in VPC (virtual private cloud) environments, like Amazon VPC, Google Cloud Platform with subnetworks, OVH Dedicated Cloud and many others, as well as any other environments where some machines are connected to more than one network.


    Another solution is to allow specifying interface name(s), like eth0, but that wouldn't work in our case because:

    1. different machines may have different order and, thus, different names of network interfaces;
    2. to make things worse, some machines may have multiple IPs assigned to the same interface, e.g. see DigitalOcean's anchor ip which is enabled by default on each new machine.

    Example for point 1: assume that I want to start some task on all machines in the cluster, and that I want this task to listen only on private interface to prevent exposing it to the outer world. Consul agent is a nice example of such service.

    Now, some machines in the cluster are connected to both public and private networks, and have two NICs:

    • eth0 corresponds to public network, say, 162.243.197.49/24;
    • eth1 corresponds to my private network 10.10.0.1/24.

    But majority of machines are only connected to a private net, and have only one NIC:

    • eth0 corresponds to the private net 10.10.0.1/24.

    This is fairly typical setup in VPC environments.

    You can see that it would be impossible to constrain my service only to private subnet by specifying interface name, because eth0 corresponds to different networks on different machines, and eth1 is even missing on some machines.

  • Provide for dependencies between tasks in a group

    Provide for dependencies between tasks in a group

    Tasks in a group sometimes need to be ordered to start up correctly.

    For example, to support the Ambassador pattern, proxy containers (P[n]) used for outbound request routing by a dependent application may be started only after the dependent application (A) is started. This is because Docker needs to know the name of A to configure shared-container networking when launching P[n].

    In the first approximation of the solution, ordering can be simple, e.g., by having the task list in a group be an array.

  • HTTP UI like consul-ui

    HTTP UI like consul-ui

    Much like consul-ui, it would be nice with a nomad-ui project to visually access and modify jobs etc

    Until Nomad has it's own native UI, jippi/hashi-ui provides a Nomad and Consul UI

  • Tens of thousands of open file descriptors to a single nomad alloc logs directory

    Tens of thousands of open file descriptors to a single nomad alloc logs directory

    Nomad version

    Nomad v0.4.1

    Operating system and Environment details

    Linux ip-10-201-5-129 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

    Issue

    Nomad has tens of thousands of open file descriptors to an alloc log directory.

    nomad 2143 root *530r DIR 202,80 4096 8454154 /var/lib/ssi/nomad/alloc/14e62a40-8598-2fed-405e-ca237bc940c6/alloc/logs

    Something similar to that repeated ~60000 times. lsof -p 2143 | wc -l returns ~60000

    I stopped the alloc but the descriptors are still there.

    In addition, the nomad process is approaching 55 GB of memory used.

  • Unable to get nomad config/get template function_denylist option

    Unable to get nomad config/get template function_denylist option

    @notnoop @tgross hi guys! I made an update for 1.2.4 but got another issue with consul templating:

    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.234Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stdout.fifo timestamp=2022-01-25T08:44:53.234Z
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stdout.fifo timestamp=2022-01-25T08:44:53.234Z
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.234Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stderr.fifo timestamp=2022-01-25T08:44:53.234Z
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon path=/data/nomad/alloc/a7c04d65-2f29-c778-c34c-2513d29f25f4/alloc/logs/.worker-mpi-resolver.stderr.fifo timestamp=2022-01-25T08:44:53.234Z
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.965Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: agent: (runner) creating new runner (dry: false, once: false)
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.966Z [INFO]  agent: (runner) creating watcher
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:53.966Z [INFO]  agent: (runner) starting
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: agent: (runner) creating watcher
    Jan 25 08:44:53 microworker03.te01-shr.nl3 nomad[4342]: agent: (runner) starting
    Jan 25 08:44:54 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:54.307Z [INFO]  client.gc: marking allocation for GC: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4
    Jan 25 08:44:54 microworker03.te01-shr.nl3 nomad[4342]: client.gc: marking allocation for GC: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4
    Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:58.309Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z
    Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z
    Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad: 2022-01-25T08:44:58.309Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z
    Jan 25 08:44:58 microworker03.te01-shr.nl3 nomad[4342]: client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=a7c04d65-2f29-c778-c34c-2513d29f25f4 task=worker-mpi-resolver @module=logmon timestamp=2022-01-25T08:44:58.309Z
    

    Nomad side:

    Template failed: /data/nomad/alloc/3a20b272-9965-8c1f-6ab0-c841e303b623/worker-mpi-resolver/local/platformConfig/nl3.tmpl: execute: template: :1:36: executing "" at <plugin "/data/tools/consul.php">: error calling plugin: function is disabled
    --
    
    
    <br class="Apple-interchange-newline" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">
    

    Originally posted by @bubejur in https://github.com/hashicorp/nomad/issues/11547#issuecomment-1020940729

  • high memory usage in logmon

    high memory usage in logmon

    I have a cluster of 20 nodes, all running "raw-exec" tasks in PHP.

    At random intervals after a while i get a lot of OOM's. I find the server with 100% swap usage and looking like this :

    Screenshot 2021-01-20 at 17 45 21

    If i restart the nomad agent, it all goes back to normal for a while.

    I also get this in nomad log : "2021-01-20T17:52:28.522+0200 [INFO] client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (89) is over the limit (50)" <--- that message is extremely ambiguous as everything runs normal and nomad was just restarted.

  • Succesfully completed batch job is re-run with new allocation.

    Succesfully completed batch job is re-run with new allocation.

    Nomad version

    Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)

    Operating system and Environment details

    3.10.0-327.36.3.el7.x86_64

    Issue

    A batch job executed, completed successfully and then several hours later, when the allocation was garbage collected was re-run.

    Reproduction steps

    Not sure. Seems to be happening frequently on our cluster though.

    Nomad logs

        2018/05/14 23:30:50.541581 [DEBUG] worker: dequeued evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
        2018/05/14 23:30:50.541765 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
        2018/05/14 23:30:50.546101 [DEBUG] worker: submitted plan at index 355103 for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
        2018/05/14 23:30:50.546140 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
        2018/05/14 23:30:50.547618 [DEBUG] worker: updated evaluation <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
        2018/05/14 23:30:50.547683 [DEBUG] worker: ack for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    
        2018/05/14 23:30:52.074437 [DEBUG] client: starting task runners for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
        2018/05/14 23:30:52.074769 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '5d0016ac-dd71-6626-f929-6398e80ef28e')
    2018-05-14T23:30:52.085-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/5d0016ac-dd71-6626-f929-6398e80ef28e/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
        2018/05/14 23:34:32.288406 [INFO] client: task "REDACTED-task" for alloc "5d0016ac-dd71-6626-f929-6398e80ef28e" completed successfully
        2018/05/14 23:34:32.288438 [INFO] client: Not restarting task: REDACTED-task for alloc: 5d0016ac-dd71-6626-f929-6398e80ef28e
        2018/05/14 23:34:32.289213 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
        2018/05/15 01:39:13.888635 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
        2018/05/15 01:39:15.389175 [WARN] client: failed to broadcast update to allocation "5d0016ac-dd71-6626-f929-6398e80ef28e"
        2018/05/15 01:39:15.389401 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
        2018/05/15 01:39:15.390656 [DEBUG] client: terminating runner for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
        2018/05/15 01:39:15.390714 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
        2018/05/15 04:25:10.541590 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
        2018/05/15 04:25:10.541626 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
        2018/05/15 05:46:37.119467 [DEBUG] worker: dequeued evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
        2018/05/15 05:46:37.139904 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
        2018/05/15 05:46:37.169051 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
        2018/05/15 05:46:37.169149 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to forced collection
        2018/05/15 05:46:37.169194 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
        2018/05/15 05:46:37.177470 [DEBUG] worker: submitted plan at index 373181 for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
        2018/05/15 05:46:37.177516 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
        2018/05/15 05:46:37.179391 [DEBUG] worker: updated evaluation <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
        2018/05/15 05:46:37.179783 [DEBUG] worker: ack for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
        2018/05/15 05:46:40.218701 [DEBUG] client: starting task runners for alloc '928b0562-b7ed-a3c7-d989-89519edadee9'
        2018/05/15 05:46:40.218982 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '928b0562-b7ed-a3c7-d989-89519edadee9')
    2018-05-15T05:46:40.230-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/928b0562-b7ed-a3c7-d989-89519edadee9/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
        2018/05/15 11:50:17.836313 [INFO] client: task "REDACTED-task" for alloc "928b0562-b7ed-a3c7-d989-89519edadee9" completed successfully
        2018/05/15 11:50:17.836336 [INFO] client: Not restarting task: REDACTED-task for alloc: 928b0562-b7ed-a3c7-d989-89519edadee9
        2018/05/15 11:50:17.836698 [INFO] client.gc: marking allocation 928b0562-b7ed-a3c7-d989-89519edadee9 for GC
    

    Job file (if appropriate)

    {
        "Job": {
            "AllAtOnce": false,
            "Constraints": [
                {
                    "LTarget": "${node.unique.id}",
                    "Operand": "=",
                    "RTarget": "52c7e5be-a5a0-3a34-1051-5209a91a0197"
                }
            ],
            "CreateIndex": 393646,
            "Datacenters": [
                "dc1"
            ],
            "ID": "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2",
            "JobModifyIndex": 393646,
            "Meta": null,
            "Migrate": null,
            "ModifyIndex": 393673,
            "Name": "REDACTED",
            "Namespace": "default",
            "ParameterizedJob": null,
            "ParentID": "REDACTED/dispatch-1526033570-3cdd72d9",
            "Payload": null,
            "Periodic": null,
            "Priority": 50,
            "Region": "global",
            "Reschedule": null,
            "Stable": false,
            "Status": "dead",
            "StatusDescription": "",
            "Stop": false,
            "SubmitTime": 1526403442162340993,
            "TaskGroups": [
                {
                    "Constraints": [
                        {
                            "LTarget": "${attr.os.signals}",
                            "Operand": "set_contains",
                            "RTarget": "SIGTERM"
                        }
                    ],
                    "Count": 1,
                    "EphemeralDisk": {
                        "Migrate": false,
                        "SizeMB": 300,
                        "Sticky": false
                    },
                    "Meta": null,
                    "Migrate": null,
                    "Name": "REDACTED",
                    "ReschedulePolicy": {
                        "Attempts": 1,
                        "Delay": 5000000000,
                        "DelayFunction": "constant",
                        "Interval": 86400000000000,
                        "MaxDelay": 0,
                        "Unlimited": false
                    },
                    "RestartPolicy": {
                        "Attempts": 1,
                        "Delay": 15000000000,
                        "Interval": 86400000000000,
                        "Mode": "fail"
                    },
                    "Tasks": [
                        {
                            "Artifacts": null,
                            "Config": {
                                "command": "REDACTED",
                                "args": [REDACTED]
                            },
                            "Constraints": null,
                            "DispatchPayload": null,
                            "Driver": "raw_exec",
                            "Env": {REDACTED},
                            "KillSignal": "SIGTERM",
                            "KillTimeout": 5000000000,
                            "Leader": false,
                            "LogConfig": {
                                "MaxFileSizeMB": 10,
                                "MaxFiles": 10
                            },
                            "Meta": null,
                            "Name": "REDACTED",
                            "Resources": {
                                "CPU": 100,
                                "DiskMB": 0,
                                "IOPS": 0,
                                "MemoryMB": 256,
                                "Networks": null
                            },
                            "Services": null,
                            "ShutdownDelay": 0,
                            "Templates": null,
                            "User": "",
                            "Vault": null
                        }
                    ],
                    "Update": null
                }
            ],
            "Type": "batch",
            "Update": {
                "AutoRevert": false,
                "Canary": 0,
                "HealthCheck": "",
                "HealthyDeadline": 0,
                "MaxParallel": 0,
                "MinHealthyTime": 0,
                "Stagger": 0
            },
            "VaultToken": "",
            "Version": 0
        }
    }
    

    What I can tell you for sure is that the allocation ran to completion and exited successfully.

    We're going to try turning off the reschedule and restart policies to see if that has any effect since we're taking care of re-running these on any sort of job failure anyway.

  • failed to submit plan for evaluation: ... no such key \\" in keyring error after moving cluster to 1.4.1">

    failed to submit plan for evaluation: ... no such key \"\" in keyring error after moving cluster to 1.4.1

    Nomad version

    Nomad v1.4.1 (2aa7e66bdb526e25f59883952d74dad7ea9a014e)

    Operating system and Environment details

    Ubuntu 22.04, Nomad 1.4.1

    Issue

    After moving the Nomad server and clients to v1.4.1, I noticed that sometimes (unfortunately not always) after cycling Nomad server ASGs and Nomad client ASGs with new AMIs, jobs scheduled on the workers can't be allocated. So to be precise:

    1. Pipeline creates new Nomad AMIs via Packer
    2. Pipeline terraforms Nomad server ASG with server config
    3. Pipeline terraforms client ASG or dedicated instances with updated AMI
    4. Lost jobs on worker (like for instance the Traefik ingress job) can't be allocated

    This literally never happened before 1.4.X

    Client output looks like this:

    nomad eval list

    ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures 427e9905 50 failed-follow-up plugin-aws-ebs-nodes default pending false 35f4fdfb 50 failed-follow-up plugin-aws-efs-nodes default pending false 46152dcd 50 failed-follow-up spot-drainer default pending false 71e3e58a 50 failed-follow-up plugin-aws-ebs-nodes default pending false e86177a6 50 failed-follow-up plugin-aws-efs-nodes default pending false 2289ba5f 50 failed-follow-up spot-drainer default pending false da3fdad6 50 failed-follow-up plugin-aws-ebs-nodes default pending false b445b976 50 failed-follow-up plugin-aws-efs-nodes default pending false 48a6771e 50 failed-follow-up ingress default pending false

    Reproduction steps

    Unclear at this point. I seem to be able to somewhat force the issue, when I cycle the Nomad server ASG with updated AMIs.

    Expected Result

    Client work that was lost, should be rescheduled once the Nomad client comes up and reports readiness.

    Actual Result

    Lost jobs that can't be allocated on worker with an updated AMI.

    nomad status

    ID Type Priority Status Submit Date auth-service service 50 pending 2022-10-09T11:32:57+02:00 ingress service 50 pending 2022-10-17T14:57:26+02:00 plugin-aws-ebs-controller service 50 running 2022-10-09T14:48:11+02:00 plugin-aws-ebs-nodes system 50 running 2022-10-09T14:48:11+02:00 plugin-aws-efs-nodes system 50 running 2022-10-09T11:37:04+02:00 prometheus service 50 pending 2022-10-18T21:19:24+02:00 spot-drainer system 50 running 2022-10-11T18:04:49+02:00

    Job file (if appropriate)

    variable "stage" {
      type        = string
      description = "The stage for this jobfile."
    }
    
    variable "domain_suffix" {
      type        = string
      description = "The HDI stage specific domain suffix."
    }
    
    variable "acme_route" {
      type = string
    }
    
    variables {
      step_cli_version = "0.22.0"
      traefik_version  = "2.9.1"
    }
    
    job "ingress" {
    
      datacenters = [join("-", ["pd0011", var.stage])]
    
      type = "service"
    
      group "ingress" {
    
        constraint {
          attribute = meta.instance_type
          value     = "ingress"
        }
    
        count = 1
    
        service {
          name = "traefik"
          tags = [
            "traefik.enable=true",
    
            "traefik.http.routers.api.rule=Host(`ingress.dsp.${var.domain_suffix}`)",
            "traefik.http.routers.api.entrypoints=secure",
            "traefik.http.routers.api.service=api@internal",
            "traefik.http.routers.api.tls.certresolver=hdi_acme_resolver",
            "traefik.http.routers.api.tls.options=tls13@file",
            "traefik.http.routers.api.middlewares=dspDefaultPlusAdmin@file",
    
            "traefik.http.routers.ping.rule=Host(`ingress.dsp.${var.domain_suffix}`) && Path(`/ping`)",
            "traefik.http.routers.ping.entrypoints=secure",
            "traefik.http.routers.ping.service=ping@internal",
            "traefik.http.routers.ping.tls.certresolver=hdi_acme_resolver",
            "traefik.http.routers.ping.tls.options=tls13@file",
            "traefik.http.routers.ping.middlewares=dspDefault@file"
          ]
    
          port = "https"
    
          check {
            name     = "Traefik Ping Endpoint"
            type     = "http"
            protocol = "http"
            port     = "http"
            path     = "/ping"
            interval = "10s"
            timeout  = "2s"
          }
        }
    
        network {
    
          port "http" {
            static = 80
            to     = 80
          }
          port "https" {
            static = 443
            to     = 443
          }
        }
    
        ephemeral_disk {
          size    = "300"
          sticky  = true
          migrate = true
        }
    
        task "generate_consul_cert" {
    <snip>
        }
    
        task "generate_nomad_cert" {
    <snip>
        }
    
    
        task "traefik" {
    
          driver = "docker"
    
          env {
            LEGO_CA_CERTIFICATES = join(":", ["${NOMAD_SECRETS_DIR}/cacert.pem", "${NOMAD_SECRETS_DIR}/root_ca_${var.stage}.crt"])
            # LEGO_CA_SYSTEM_CERT_POOL = true
          }
    
          config {
            image = "traefik:${var.traefik_version}"
            volumes = [
              # Use absolute paths to mount arbitrary paths on the host
              "local/:/etc/traefik/",
              "/etc/timezone:/etc/timezone:ro",
              "/etc/localtime:/etc/localtime:ro",
            ]
            network_mode = "host"
            ports        = ["http", "https"]
          }
    
          resources {
            cpu    = 800
            memory = 128
          }
          # Controls the timeout between signalling a task it will be killed
          # and killing the task. If not set a default is used.
          kill_timeout = "60s"
    
          template {
            data        = <<EOH
    <snip>
        }
      }
    }
    
    

    Nomad Server logs (if appropriate)

    Oct 20 15:00:30 uat-nomad-95I nomad[485]:     2022-10-20T15:00:30.571+0200 [ERROR] worker: error invoking scheduler: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab error="failed to process evaluation: rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
    Oct 20 15:00:42 uat-nomad-95I nomad[485]:     2022-10-20T15:00:42.948+0200 [ERROR] worker: failed to submit plan for evaluation: worker_id=c4d91fc3-5e23-dbec-a85d-8fc830f375ab eval_id=827f0dfe-0584-b44a-92e2-9a92ab649c48 error="rpc error: no such key \"7d11bdf6-26f0-c4fa-5c04-b73b0f46eedb\" in keyring"
    

    Nomad Client logs (if appropriate)

    Oct 20 11:55:00 uat-worker-wZz nomad[464]:              Log Level: INFO
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:                 Region: europe (DC: pd0011-uat)
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:                 Server: false
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:                Version: 1.4.1
    Oct 20 11:55:00 uat-worker-wZz nomad[464]: ==> Nomad agent started! Log data will stream in below:
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.798+0200 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.817+0200 [INFO]  client: using state directory: state_dir=/opt/hsy/nomad/data/client
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.826+0200 [INFO]  client: using alloc directory: alloc_dir=/opt/hsy/nomad/data/alloc
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.826+0200 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.831+0200 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.852+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.856+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.870+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.897+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.900+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.900+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=device
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:54:58.906+0200 [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"https://127.0.0.1:8501/v1/catalog/datacenters\": dial tcp 127.0.0.1:8501: connect: connection refused"
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.437+0200 [INFO]  client: started client: node_id=5f21ebef-e0a9-8bd2-775a-61b3e32cac6e
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.437+0200 [WARN]  agent: not registering Nomad HTTPS Health Check because verify_https_client enabled
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.438+0200 [WARN]  client.server_mgr: no servers available
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.439+0200 [WARN]  client.server_mgr: no servers available
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.453+0200 [INFO]  client.consul: discovered following servers: servers=[10.194.73.146:4647, 10.194.74.253:4647, 10.194.75.103:4647]
    Oct 20 11:55:00 uat-worker-wZz nomad[464]:     2022-10-20T11:55:00.501+0200 [INFO]  client: node registration complete
    Oct 20 11:55:06 uat-worker-wZz nomad[464]:     2022-10-20T11:55:06.856+0200 [INFO]  client: node registration complete
    Oct 20 11:55:14 uat-worker-wZz nomad[464]:     2022-10-20T11:55:14.893+0200 [INFO]  client.fingerprint_mgr.consul: consul agent is available
    Oct 20 11:55:21 uat-worker-wZz nomad[464]:     2022-10-20T11:55:21.417+0200 [INFO]  client: node registration complete
    
  • docs: clarify shutdown_delay jobspec param and service behaviour.

    docs: clarify shutdown_delay jobspec param and service behaviour.

    Clarifies that the task and group level shutdown_delay parameters do not influence each other and that they apply to both Nomad and Consul service registrations.

    The change also clarifies that service blocks apply to both Nomad and Consul service registrations.

    Closes #15602

  • build(deps): bump json5 from 1.0.1 to 1.0.2 in /website

    build(deps): bump json5 from 1.0.1 to 1.0.2 in /website

    Bumps json5 from 1.0.1 to 1.0.2.

    Release notes

    Sourced from json5's releases.

    v1.0.2

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295). This has been backported to v1. (#298)
    Changelog

    Sourced from json5's changelog.

    Unreleased [code, diff]

    v2.2.3 [code, diff]

    v2.2.2 [code, diff]

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

    v2.2.1 [code, diff]

    • Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

    v2.2.0 [code, diff]

    • New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

    v2.1.3 [code, diff]

    • Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

    v2.1.2 [code, diff]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.
  • Drivers: Make InternalCapabilities.DisableLogCollection Public

    Drivers: Make InternalCapabilities.DisableLogCollection Public

    Proposal

    The current set of Internal Driver Capabilities have been relatively stable since the introduction of the plugin interface, and some would greatly benefit 3rd party users of Nomad and Custom Task Drivers. This proposal suggests moving the DisableLogCollection feature to the public capabilities API.

    While not perfect in isolation (Nomad offers no way to disable log collection for individual tasks, which means when scheduling to a driver with log collection disabled, the task still has an associated set of resources allocated for log storage), this goes a long way towards reducing overhead in a range of cases.

    Use-cases

    There are many use-cases where a driver author might want to give operators the ability to save on the overhead of deploying an unnecessary logmon for every task. Similarly to the Nomad Docker Driver's usage, the most obvious is where a runtime offers the ability to manage logs externally to Nomad.

    This is particularly important when building high-density runtimes (and task drivers) - where no-op logmon overhead can quickly become an issue when deploying X000 allocations/node and logging is handled externally.

    Attempted Solutions

    We currently deploy a fork of Nomad that has a slightly jankier approach đŸ˜….

  • support realtime signals (SIGRTMIN/SIGRTMAX)

    support realtime signals (SIGRTMIN/SIGRTMAX)

    Problem Hashicorp Nomad Client is discovering Unix OS signal via consul-template/blob/main/signals/signals.go and Consul Templates Unix Signals and it is missing SIGRTMIN/SIGRTMAX signals as of now.

    Dependency Follow Hashicorp Nomad Client ---> consul-templates

    Proposal

    If someone have knowledge about OS Signals and Golang, please add this support. I have created a issue in base library to auto discover these signals https://github.com/hashicorp/consul-template/issues/1691

    Use-cases

    In my use-case where I am running Apache Impala which is using SIGRTMIN as graceful shutdown signal. I can't send this signal via Hashicorp Nomad Job since its not being discovered by Client and Hashicorp Client doesn't implement any specific logic to find supported signals.

    Attempted Solutions

    Current work around for this is to wrap Apache Impala in a bash script and handle/manipulate Kill signal from Nomad.

  • Nomad duplicates logs to stdout when `enable_syslog` is true

    Nomad duplicates logs to stdout when `enable_syslog` is true

    Nomad version

    Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)
    

    Operating system and Environment details

    $ lsb_release  -a
    No LSB modules are available.
    Distributor ID: Ubuntu
    Description:    Ubuntu 22.04.1 LTS
    Release:        22.04
    Codename:       jammy
    

    Issue

    When enable_syslog is true, Nomad logs both to syslog and to stdout.

    Reproduction steps

    journalctl shows this but log lines aren't always interspersed one by one but rather in blocks. It's easy to check by passing _TRANSPORT.

    Expected Result

    $ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=syslog
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (clients) disabling vault SSL verification
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) creating watcher
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
    $ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=stdout
    $
    

    Actual Result

    $ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=syslog
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (clients) disabling vault SSL verification
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) creating watcher
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:  agent: (runner) starting
    $ sudo journalctl --lines=5 --unit=nomad.service _TRANSPORT=stdout
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.567+0100 [INFO]  agent: (runner) starting
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.567+0100 [WARN]  agent: (clients) disabling vault SSL verification
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.568+0100 [INFO]  agent: (runner) creating watcher
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.568+0100 [INFO]  agent: (runner) starting
    Jan 03 16:07:08 nomad-client-camel nomad[11714]:     2023-01-03T16:07:08.568+0100 [INFO]  agent: (runner) starting
    $
    

    Job file (if appropriate)

    Not applicable.

    Nomad Server logs (if appropriate)

    Not applicable.

    Nomad Client logs (if appropriate)

    Not applicable.

  • Running task using docker driver on RHEL 9.1 using podman-docker fails with cgroupv2 related error

    Running task using docker driver on RHEL 9.1 using podman-docker fails with cgroupv2 related error

    Nomad version

    Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)

    Operating system and Environment details

    Freshly installed VirtualBox VM using Alma Linux 9.1, following the tutorial:

    # dnf install docker # actually installs podman / podman-docker
    # dnf config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
    # dnf install nomad
    # nomad agent -dev -bind 0.0.0.0 -log-level INFO
    

    In another shell:

    # nomad job init -short
    # nomad run example.job
    

    Issue

    Service doesn't start due to a docker / podman / OCI cgroupsv2 error, log:

        2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
        2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
        2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
        2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
        2023-01-02T18:28:21.150+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
        2023-01-02T18:28:21.154+0100 [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:d73401c2-feb4-c983-f59e-f94c8153957b Address:[2001:638:50d:110c:a00:27ff:feba:f18b]:4647}]"
        2023-01-02T18:28:21.154+0100 [INFO]  nomad.raft: entering follower state: follower="Node at [2001:638:50d:110c:a00:27ff:feba:f18b]:4647 [Follower]" leader-address= leader-id=
        2023-01-02T18:28:21.156+0100 [INFO]  nomad: serf: EventMemberJoin: nomad-test.global 2001:638:50d:110c:a00:27ff:feba:f18b
        2023-01-02T18:28:21.156+0100 [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
        2023-01-02T18:28:21.156+0100 [INFO]  nomad: started scheduling worker(s): num_workers=4 schedulers=["service", "batch", "system", "sysbatch", "_core"]
        2023-01-02T18:28:21.156+0100 [INFO]  nomad: adding server: server="nomad-test.global (Addr: [2001:638:50d:110c:a00:27ff:feba:f18b]:4647) (DC: dc1)"
        2023-01-02T18:28:21.157+0100 [INFO]  client: using state directory: state_dir=/tmp/NomadClient1825089631
        2023-01-02T18:28:21.157+0100 [INFO]  client: using alloc directory: alloc_dir=/tmp/NomadClient3100991855
        2023-01-02T18:28:21.157+0100 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
        2023-01-02T18:28:21.160+0100 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
        2023-01-02T18:28:21.166+0100 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
        2023-01-02T18:28:21.169+0100 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
        2023-01-02T18:28:21.186+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
        2023-01-02T18:28:21.186+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
        2023-01-02T18:28:21.186+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
        2023-01-02T18:28:21.566+0100 [INFO]  client: started client: node_id=f3911116-a082-524e-96ca-fbe308ce2393
        2023-01-02T18:28:22.340+0100 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
        2023-01-02T18:28:22.340+0100 [INFO]  nomad.raft: entering candidate state: node="Node at [2001:638:50d:110c:a00:27ff:feba:f18b]:4647 [Candidate]" term=2
        2023-01-02T18:28:22.340+0100 [INFO]  nomad.raft: election won: term=2 tally=1
        2023-01-02T18:28:22.340+0100 [INFO]  nomad.raft: entering leader state: leader="Node at [2001:638:50d:110c:a00:27ff:feba:f18b]:4647 [Leader]"
        2023-01-02T18:28:22.341+0100 [INFO]  nomad: cluster leadership acquired
        2023-01-02T18:28:22.350+0100 [INFO]  nomad.core: established cluster id: cluster_id=abcaf78b-a317-b442-fc65-855c6bfe81a0 create_time=1672680502349978151
        2023-01-02T18:28:22.350+0100 [INFO]  nomad: eval broker status modified: paused=false
        2023-01-02T18:28:22.350+0100 [INFO]  nomad: blocked evals status modified: paused=false
        2023-01-02T18:28:22.358+0100 [INFO]  nomad.keyring: initialized keyring: id=9eed1ad5-4612-6541-8aab-61bf621017c9
        2023-01-02T18:28:22.421+0100 [INFO]  client: node registration complete
        2023-01-02T18:28:23.426+0100 [INFO]  client: node registration complete
        2023-01-02T18:28:48.069+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=3a4f9764-8493-b2a6-84bb-affdfbb0668a task=redis path=/tmp/NomadClient3100991855/3a4f9764-8493-b2a6-84bb-affdfbb0668a/alloc/logs/.redis.stdout.fifo @module=logmon timestamp="2023-01-02T18:28:48.069+0100"
        2023-01-02T18:28:48.069+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=3a4f9764-8493-b2a6-84bb-affdfbb0668a task=redis path=/tmp/NomadClient3100991855/3a4f9764-8493-b2a6-84bb-affdfbb0668a/alloc/logs/.redis.stderr.fifo @module=logmon timestamp="2023-01-02T18:28:48.069+0100"
        2023-01-02T18:28:53.578+0100 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=2eedc8d76ee4dfbe2bdc62463eb45e344e4c226a98af5574d5d7e3c8ee71c71b
        2023-01-02T18:30:05.885+0100 [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=2eedc8d76ee4dfbe2bdc62463eb45e344e4c226a98af5574d5d7e3c8ee71c71b error="API error (500): crun: cannot set memory swappiness with cgroupv2: OCI runtime error"
        2023-01-02T18:30:05.916+0100 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=3a4f9764-8493-b2a6-84bb-affdfbb0668a task=redis error="Failed to start container 2eedc8d76ee4dfbe2bdc62463eb45e344e4c226a98af5574d5d7e3c8ee71c71b: API error (500): crun: cannot set memory swappiness with cgroupv2: OCI runtime error"
    

    Reproduction steps

    See above

    Expected Result

    Works

    Actual Result

    See above

    Job file (if appropriate)

    Default nomad init job

    Nomad Server logs (if appropriate)

    See above

    Nomad Client logs (if appropriate)

    See above

Related tags
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

kube-batch kube-batch is a batch scheduler for Kubernetes, providing mechanisms for applications which would like to run batch jobs leveraging Kuberne

Jan 6, 2023
Deploy https certificates non-interactively to CDN services

certdeploy Deploy https certificates non-interactively to CDN services. Environment Variables CERT_PATH - Certificate file path, should contain certif

Nov 27, 2022
Natural-deploy - A natural and simple way to deploy workloads or anything on other machines.

Natural Deploy Its Go way of doing Ansibles: Motivation: Have you ever felt when using ansible or any declarative type of program that is used for dep

Jan 3, 2022
The Container Storage Interface (CSI) Driver for Fortress Block Storage This driver allows you to use Fortress Block Storage with your container orchestrator

fortress-csi The Container Storage Interface (CSI) Driver for Fortress Block Storage This driver allows you to use Fortress Block Storage with your co

Jan 23, 2022
Kubernetes is an open source system for managing containerized applications across multiple hosts.
Kubernetes is an open source system for managing containerized applications across multiple hosts.

Kubernetes Kubernetes is an open source system for managing containerized applications across multiple hosts. It provides basic mechanisms for deploym

Nov 25, 2021
Fleex allows you to create multiple VPS on cloud providers and use them to distribute your workload.
Fleex allows you to create multiple VPS on cloud providers and use them to distribute your workload.

Fleex allows you to create multiple VPS on cloud providers and use them to distribute your workload. Run tools like masscan, puredns, ffuf, httpx or a

Dec 31, 2022
Workflow Orchestrator
Workflow Orchestrator

Adagio - A Workflow Orchestrator This project is currently in a constant state of flux. Don't expect it to work. Thank you o/ Adagio is a workflow exe

Sep 2, 2022
Orchestrator Service - golang
Orchestrator Service - golang

Orchestrator Service - golang Prerequisites golang protoc compiler Code Editor (for ex. VS Code) Postman BloomRPC About Operating System Used for Deve

Feb 15, 2022
A Simple Orchestrator Service implemented using gRPC in Golang

Orchestrator Service The goal of this program is to build an orchestrator service that would read any request it receives and forwards it to other orc

Apr 5, 2022
Ensi-local-ctl - ELC - orchestrator of development environments

ELC - orchestrator of development environments With ELC you can: start a couple

Oct 13, 2022
Deploy, manage, and secure applications and resources across multiple clusters using CloudFormation and Shipa

CloudFormation provider Deploy, secure, and manage applications across multiple clusters using CloudFormation and Shipa. Development environment setup

Feb 12, 2022
Build and deploy Go applications on Kubernetes
Build and deploy Go applications on Kubernetes

ko: Easy Go Containers ko is a simple, fast container image builder for Go applications. It's ideal for use cases where your image contains a single G

Jan 5, 2023
Easily deploy your Go applications with Dokku.

dokku-go-example Easily deploy your Go applications with Dokku. Features: Deploy on your own server Auto deployment HTTPS Check the full step by step

Aug 21, 2022
Small and easy server for web-hooks to deploy software on push from gitlab/github/hg and so on

Deployment mini-service This mini web-server is made to deploy your code without yaml-files headache. If you just need to update your code somewhere a

Dec 4, 2022
DigitalOcean Droplets target plugin for HashiCorp Nomad Autoscaler

Nomad DigitalOcean Droplets Autoscaler The do-droplets target plugin allows for the scaling of the Nomad cluster clients via creating and destroying D

Dec 8, 2022
The Operator Pattern, in Nomad

Nomad Operator Example Repostiory to go along with my The Operator Pattern in Nomad blog post. Usage If you have tmux installed, you can run start.sh

May 12, 2022
A simple Go app and GitHub workflow that shows how to use GitHub Actions to test, build and deploy a Go app to Docker Hub

go-pipeline-demo A repository containing a simple Go app and GitHub workflow that shows how to use GitHub Actions to test, build and deploy a Go app t

Nov 17, 2021
Use Terraform to build and deploy configurations for Juniper SRX firewalls.
Use Terraform to build and deploy configurations for Juniper SRX firewalls.

Juniper Terraform - SRX Overview The goal of this project is to provide an example method to interact with Juniper SRX products with Terraform. ?? Ter

Mar 16, 2022
Reconstruct Open API Specifications from real-time workload traffic seamlessly
Reconstruct Open API Specifications from real-time workload traffic seamlessly

Reconstruct Open API Specifications from real-time workload traffic seamlessly: Capture all API traffic in an existing environment using a service-mes

Jan 1, 2023